Skip to content

RATIS-1709 Support specify ThreadGroup for Daemon threads#733

Merged
szetszwo merged 30 commits intoapache:masterfrom
jiacheliu3:store-error-in-daemon
Sep 21, 2022
Merged

RATIS-1709 Support specify ThreadGroup for Daemon threads#733
szetszwo merged 30 commits intoapache:masterfrom
jiacheliu3:store-error-in-daemon

Conversation

@jiacheliu3
Copy link
Contributor

@jiacheliu3 jiacheliu3 commented Sep 3, 2022

What changes were proposed in this pull request?

  1. Updates the RaftServer API so the application can pass in an ThreadGroup to control Ratis threads consistently with the application. The app can specify ThreadGroup.uncaughtException() to catch uncaught exceptions from ratis.
  2. The most critical Daemon threads will be created with the app-specified ThreadGroup, so the application will decide what to do when RaftServer is seeing an unhandled error
  3. Only the most important Daemon threads will have this ThreadGroup and others will not, so unimportant transient errors will stay inside Ratis abstraction

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-1709

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)

public void setError(Throwable t) {
throwable = t;
LOG.error("Server transitioning to EXCEPTION state due to", t);
// TODO(jiacheng): will the server keep serving or just die itself?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm new to Ratis code in general so not sure what's the recommended design. Will this be a good place for the server to exit, or it should check this state and exit somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should let the application decide. Therefore, we should pass an UncaughtExceptionHandler from the application.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a larger question than this particular fix but, would an UncaughtExceptionHandler be too specific? I imagine the application prefers something like a pre-crash hook that's applied to all places where the RaftServer is going to crash or be set to EXCEPTION lifecycle state.

I'm good with just an UncaughtExceptionHandler for now of course because that's what my application needs.

try {
getServerRpc().close();
} catch(IOException ignored) {
// TODO(jiacheng): transition state to EXCEPTION here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing code I don't think we transition to EXCEPTION very often. In many places like this, we might wanna transition the state and expose the exception to the outside. And the Ratis user can just observe the RaftServer lifecycle and choose to restart the RaftServer or do whatever else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is in CLOSING state, it must transit to CLOSED but won't transit to EXCEPTION. CLOSING must handle the exceptions, if there are any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this logic is gone tgt with the ErrorRecorded interface. I no longer have this default behavior in the RaftServerProxy because it now takes an uncaughtExceptionHandler from the application. Is that okay? @szetszwo

Executor(Object name, int size) {
Preconditions.assertTrue(size > 0);
// TODO(jiacheng): intentionally ignoring the server ref here? Why?
executor = Executors.newFixedThreadPool(size, r -> new Daemon(r, name + "-" + count.incrementAndGet()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally not passing the RaftServerImpl reference to the Daemon thread so a dead thread doesn't set the RaftServerImpl state to EXCEPTION. My intuition is that a failed prevote doesn't matter that much because it will be retried. Could someone confirm?

this.setName(name);
}

public Daemon(Runnable runnable, String name, ErrorRecorded server) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignored Daemon threads created in JvmPauseMonitor and TimeoutScheduler because my intuition is even if they crash, the RaftServer can still live.

@jiacheliu3
Copy link
Contributor Author

@codings-dan PTAL if the design meets expectation and pls feel free to add any reviewers to the discussion. If you are all okay with the idea and design, I will move on to add unit tests.

On unit tests, I'm honestly not sure what you would expect me to add. I mean, this PR adds UnexpectedExceptionHandlers then I guess I should add tests to trigger unhandled exceptions to each Daemon thread? Or just having one test on the Daemon thread class itself would be acceptable? I'm new to Ratis repo so I'd like to conform to how you normally test. Any advice is appreciated!

@codings-dan
Copy link
Contributor

@jiacheliu3 Thanks for working on this, I will review the change recently @szetszwo Could you help take a look at this pull request?

@szetszwo
Copy link
Contributor

szetszwo commented Sep 5, 2022

@codings-dan , sure, will review this.

@jiacheliu3 , thanks a lot for working this! There were some failures in https://github.com/apache/ratis/actions/runs/2984287869 . Could you take a look?

@jiacheliu3
Copy link
Contributor Author

Is this one flaky? This UT passes on my local build.

Error:  Tests run: 16, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 42.736 s <<< FAILURE! - in org.apache.ratis.server.simulation.TestLeaderElectionWithSimulatedRpc
Error:  testRemoveListener(org.apache.ratis.server.simulation.TestLeaderElectionWithSimulatedRpc)  Time elapsed: 1.711 s  <<< ERROR!
java.lang.IllegalStateException: Found duplicated element s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER in [s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER, s2|rpc:localhost:40489|admin:localhost:40079|client:localhost:44937|dataStream:localhost:40043|priority:0|startupRole:FOLLOWER, s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER]
	at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:73)
	at org.apache.ratis.util.Preconditions.assertUnique(Preconditions.java:128)
	at org.apache.ratis.util.Preconditions.assertUnique(Preconditions.java:122)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments.<init>(SetConfigurationRequest.java:59)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments.<init>(SetConfigurationRequest.java:36)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments$Builder.build(SetConfigurationRequest.java:144)
	at org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:55)
	at org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:40)
	at org.apache.ratis.server.impl.LeaderElectionTests.testRemoveListener(LeaderElectionTests.java:353)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:829)

@codings-dan
Copy link
Contributor

Is this one flaky? This UT passes on my local build.

Error:  Tests run: 16, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 42.736 s <<< FAILURE! - in org.apache.ratis.server.simulation.TestLeaderElectionWithSimulatedRpc
Error:  testRemoveListener(org.apache.ratis.server.simulation.TestLeaderElectionWithSimulatedRpc)  Time elapsed: 1.711 s  <<< ERROR!
java.lang.IllegalStateException: Found duplicated element s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER in [s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER, s2|rpc:localhost:40489|admin:localhost:40079|client:localhost:44937|dataStream:localhost:40043|priority:0|startupRole:FOLLOWER, s0|rpc:localhost:38565|admin:localhost:42183|client:localhost:37945|dataStream:localhost:42749|priority:0|startupRole:FOLLOWER]
	at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:73)
	at org.apache.ratis.util.Preconditions.assertUnique(Preconditions.java:128)
	at org.apache.ratis.util.Preconditions.assertUnique(Preconditions.java:122)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments.<init>(SetConfigurationRequest.java:59)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments.<init>(SetConfigurationRequest.java:36)
	at org.apache.ratis.protocol.SetConfigurationRequest$Arguments$Builder.build(SetConfigurationRequest.java:144)
	at org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:55)
	at org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:40)
	at org.apache.ratis.server.impl.LeaderElectionTests.testRemoveListener(LeaderElectionTests.java:353)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:829)

let me re-run the ci test

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 , thanks again for working on this. It is better to pass an UncaughtExceptionHandler so that the application can decide what to do. Please see the comments inlined.

try {
getServerRpc().close();
} catch(IOException ignored) {
// TODO(jiacheng): transition state to EXCEPTION here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it is in CLOSING state, it must transit to CLOSED but won't transit to EXCEPTION. CLOSING must handle the exceptions, if there are any.

public static class Builder {
private final String name;
private Runnable runnable;
private ErrorRecorded statedServer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pass an UncaughtExceptionHandler instead of adding ErrorRecorded.

public void setError(Throwable t) {
throwable = t;
LOG.error("Server transitioning to EXCEPTION state due to", t);
// TODO(jiacheng): will the server keep serving or just die itself?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should let the application decide. Therefore, we should pass an UncaughtExceptionHandler from the application.

@jiacheliu3
Copy link
Contributor Author

@szetszwo Thanks for the comments! What do you think about exposing the exception on the RaftServer? I come from Alluxio and we use Ratis RaftServer to manage the file system metadata.
https://github.com/Alluxio/alluxio/blob/aacf14507174029fac4a997b6037207fae92ea91/core/server/common/src/main/java/alluxio/master/journal/raft/RaftJournalSystem.java#L429
It'll be very helpful if the lifecycle state on the RaftServer is more up-to-date and more ideally, we can see what error caused RaftServer to fail. It is fair if you say we don't expose the exception outside RaftServer (because the external party normally can't do better recovery than the RaftServer is capable of), but it could be more extensible to future use cases if we do expose the exception out of RaftServer. WDYT?

@szetszwo
Copy link
Contributor

szetszwo commented Sep 5, 2022

@jiacheliu3 , What do you think about exposing the exception on the RaftServer?

It is a good idea. We can do it by passing an UncaughtExceptionHandler; see https://issues.apache.org/jira/secure/attachment/13048963/733_review.patch

private static final Method NEW_RAFT_SERVER_METHOD = initNewRaftServerMethod();

private static Method initNewRaftServerMethod() {
// TODO(jiacheng): backward compatibility?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider how to keep the most backward compatibility as possible

return this;
}

public Builder setUncaughtExceptionHandler(UncaughtExceptionHandler exceptionHandler) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much UT would be necessary. If necessary, I'll try writing something similar to RaftStateMachineExceptionTests using a MiniRaftCluster which creates a RaftServer with my user defined UncaughtExceptionHandler, and when I throw an exception internally, check if the handler is able to catch that.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 , thanks for the update! After checked the Java API in more details, it seems better to pass a ThreadGroup instead a UncaughtExceptionHandler (sorry that it was a bad idea); see https://issues.apache.org/jira/secure/attachment/13049347/733_review.patch

Comment on lines 26 to 28
static final Logger LOG = LoggerFactory.getLogger(Daemon.class);
public static final Thread.UncaughtExceptionHandler LOG_EXCEPTION =
(t, e) -> LOG.error(t.getName() + " threw an uncaught exception", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JVM already has a default UncaughtExceptionHandler (which prints out the exception to the System.err). Also, it has an API https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#setDefaultUncaughtExceptionHandler-java.lang.Thread.UncaughtExceptionHandler- . We probably should not have a different default. If we do that, applications using Thread.setDefaultUncaughtExceptionHandler will not work.

@jiacheliu3 jiacheliu3 changed the title RATIS-1709 Support passing UncaughtExceptionHandler from app to RaftServer RATIS-1709 Support specify ThreadGroup for Ratis threads Sep 16, 2022
@jiacheliu3 jiacheliu3 changed the title RATIS-1709 Support specify ThreadGroup for Ratis threads RATIS-1709 Support specify ThreadGroup for Daemon threads Sep 16, 2022
@jiacheliu3
Copy link
Contributor Author

@jiacheliu3 , thanks for the update! After checked the Java API in more details, it seems better to pass a ThreadGroup instead a UncaughtExceptionHandler (sorry that it was a bad idea); see https://issues.apache.org/jira/secure/attachment/13049347/733_review.patch

@szetszwo Thanks for the suggestion that makes sense to me. One subtle implication I'm not sure about is, if the ThreadGroup is left null when creating a thread, that thread uses the current thread's ThreadGroup: code

I'm not 100% sure what that suggests. Specifically I'm not sure if threads from TimeoutScheduler.newExecutor() code will implicitly also inherit this ThreadGroup and uncaught exceptions there will propagate to the application in the same way. Is that true and is that desirable? In other words I imagine there are critical threads whose uncaught exceptions should crash the RaftServer, whereas there are unimportant threads whose crashes should be somehow tolerated and just generate a warning at most.

In contrast if we specify an UncaughtExceptionHandler in the old way, it is possible to distinguish the critical threads from the others. Not saying setUncaughtExceptionHandler() doesn't have downsides, just saying it's easier to manually configure.

@jiacheliu3 jiacheliu3 requested a review from szetszwo September 16, 2022 06:31
Copy link
Contributor

@codings-dan codings-dan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 Thanks a lot for working on this , overall lgtm, left some comments for ref

private final LogAppender logAppender;

LogAppenderDaemon(LogAppender logAppender) {
LogAppenderDaemon(LogAppender logAppender, RaftServer.Division server) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to pass in a ThreadGroup here instead of passing in a RaftServer.Division

@jiacheliu3 jiacheliu3 requested review from codings-dan and szetszwo and removed request for codings-dan and szetszwo September 16, 2022 14:57
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 , thanks for the update! Some comments inlined.

private final ExecutorService executor;

private final JvmPauseMonitor pauseMonitor;
@Nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a default ThreadGroup and make it non-nullable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I figure a null here is fine. doc specifies that is the ThreadGroup arg is null, the current thread's group is inherited. That behavior can be better than we guess a default here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly... OpenJDK does not really respect a null ThreadGroup here and throws NPE on me. I will change this to non-null.

public static RaftServerProxy newRaftServer(
RaftPeerId id, RaftGroup group, RaftStorage.StartupOption option, StateMachine.Registry stateMachineRegistry,
RaftProperties properties, Parameters parameters) throws IOException {
RaftProperties properties, Parameters parameters, ThreadGroup threadGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorder the parameters:

ThreadGroup threadGroup, RaftProperties properties, Parameters parameters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rearranged together with all the places that have the same argument list

final TimeDuration leaderStepDownWaitTime = RaftServerConfigKeys.LeaderElection.leaderStepDownWaitTime(properties);
this.pauseMonitor = new JvmPauseMonitor(id,
extraSleep -> handleJvmPause(extraSleep, rpcSlownessTimeout, leaderStepDownWaitTime));
this.threadGroup = threadGroup == null ? new ThreadGroup("raft-server-proxy") : threadGroup;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figure it's safer to do the null check here than in the Builder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm just applying a ThreadGroup that has no unhandledException(). In the future if it's proved to be needed I can submit another PR for a better default ThreadGroup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default uncaughtException(..) probably is okay since it will use the system default.

@jiacheliu3 jiacheliu3 requested a review from szetszwo September 19, 2022 15:16
@jiacheliu3
Copy link
Contributor Author

@szetszwo PTAL if the changes look good to me. I suppose the comments are mostly resolved so it'll be a quick one :) Thanks!

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 , thanks for the update! Just some minor comments inlined.

final TimeDuration leaderStepDownWaitTime = RaftServerConfigKeys.LeaderElection.leaderStepDownWaitTime(properties);
this.pauseMonitor = new JvmPauseMonitor(id,
extraSleep -> handleJvmPause(extraSleep, rpcSlownessTimeout, leaderStepDownWaitTime));
this.threadGroup = threadGroup == null ? new ThreadGroup("raft-server-proxy") : threadGroup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default uncaughtException(..) probably is okay since it will use the system default.

@jiacheliu3
Copy link
Contributor Author

This test passes on my local PC, I believe this one is flaky.

Error:  Failures: 
Error:    TestLeaderElectionWithSimulatedRpc>LeaderElectionTests.testEnforceLeader:284->LeaderElectionTests.enforceLeader:298 expected:<s[1]> but was:<s[3]>

@jiacheliu3 jiacheliu3 requested a review from szetszwo September 21, 2022 06:33
@codings-dan
Copy link
Contributor

Let me re-run the failed-job.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 31eff22 into apache:master Sep 21, 2022
codings-dan pushed a commit to codings-dan/incubator-ratis that referenced this pull request Sep 21, 2022
codings-dan pushed a commit that referenced this pull request Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants