RATIS-1709 Support specify ThreadGroup for Daemon threads#733
RATIS-1709 Support specify ThreadGroup for Daemon threads#733szetszwo merged 30 commits intoapache:masterfrom
Conversation
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Outdated
Show resolved
Hide resolved
| public void setError(Throwable t) { | ||
| throwable = t; | ||
| LOG.error("Server transitioning to EXCEPTION state due to", t); | ||
| // TODO(jiacheng): will the server keep serving or just die itself? |
There was a problem hiding this comment.
Sorry I'm new to Ratis code in general so not sure what's the recommended design. Will this be a good place for the server to exit, or it should check this state and exit somewhere else?
There was a problem hiding this comment.
We should let the application decide. Therefore, we should pass an UncaughtExceptionHandler from the application.
There was a problem hiding this comment.
This might be a larger question than this particular fix but, would an UncaughtExceptionHandler be too specific? I imagine the application prefers something like a pre-crash hook that's applied to all places where the RaftServer is going to crash or be set to EXCEPTION lifecycle state.
I'm good with just an UncaughtExceptionHandler for now of course because that's what my application needs.
| try { | ||
| getServerRpc().close(); | ||
| } catch(IOException ignored) { | ||
| // TODO(jiacheng): transition state to EXCEPTION here? |
There was a problem hiding this comment.
In the existing code I don't think we transition to EXCEPTION very often. In many places like this, we might wanna transition the state and expose the exception to the outside. And the Ratis user can just observe the RaftServer lifecycle and choose to restart the RaftServer or do whatever else.
There was a problem hiding this comment.
When it is in CLOSING state, it must transit to CLOSED but won't transit to EXCEPTION. CLOSING must handle the exceptions, if there are any.
There was a problem hiding this comment.
FYI this logic is gone tgt with the ErrorRecorded interface. I no longer have this default behavior in the RaftServerProxy because it now takes an uncaughtExceptionHandler from the application. Is that okay? @szetszwo
| Executor(Object name, int size) { | ||
| Preconditions.assertTrue(size > 0); | ||
| // TODO(jiacheng): intentionally ignoring the server ref here? Why? | ||
| executor = Executors.newFixedThreadPool(size, r -> new Daemon(r, name + "-" + count.incrementAndGet())); |
There was a problem hiding this comment.
Intentionally not passing the RaftServerImpl reference to the Daemon thread so a dead thread doesn't set the RaftServerImpl state to EXCEPTION. My intuition is that a failed prevote doesn't matter that much because it will be retried. Could someone confirm?
| this.setName(name); | ||
| } | ||
|
|
||
| public Daemon(Runnable runnable, String name, ErrorRecorded server) { |
There was a problem hiding this comment.
I ignored Daemon threads created in JvmPauseMonitor and TimeoutScheduler because my intuition is even if they crash, the RaftServer can still live.
|
@codings-dan PTAL if the design meets expectation and pls feel free to add any reviewers to the discussion. If you are all okay with the idea and design, I will move on to add unit tests. On unit tests, I'm honestly not sure what you would expect me to add. I mean, this PR adds UnexpectedExceptionHandlers then I guess I should add tests to trigger unhandled exceptions to each Daemon thread? Or just having one test on the Daemon thread class itself would be acceptable? I'm new to Ratis repo so I'd like to conform to how you normally test. Any advice is appreciated! |
|
@jiacheliu3 Thanks for working on this, I will review the change recently @szetszwo Could you help take a look at this pull request? |
|
@codings-dan , sure, will review this. @jiacheliu3 , thanks a lot for working this! There were some failures in https://github.com/apache/ratis/actions/runs/2984287869 . Could you take a look? |
|
Is this one flaky? This UT passes on my local build. |
let me re-run the ci test |
szetszwo
left a comment
There was a problem hiding this comment.
@jiacheliu3 , thanks again for working on this. It is better to pass an UncaughtExceptionHandler so that the application can decide what to do. Please see the comments inlined.
| try { | ||
| getServerRpc().close(); | ||
| } catch(IOException ignored) { | ||
| // TODO(jiacheng): transition state to EXCEPTION here? |
There was a problem hiding this comment.
When it is in CLOSING state, it must transit to CLOSED but won't transit to EXCEPTION. CLOSING must handle the exceptions, if there are any.
| public static class Builder { | ||
| private final String name; | ||
| private Runnable runnable; | ||
| private ErrorRecorded statedServer; |
There was a problem hiding this comment.
Let's pass an UncaughtExceptionHandler instead of adding ErrorRecorded.
ratis-common/src/main/java/org/apache/ratis/util/TimeoutScheduler.java
Outdated
Show resolved
Hide resolved
ratis-common/src/main/java/org/apache/ratis/util/ErrorRecorded.java
Outdated
Show resolved
Hide resolved
| public void setError(Throwable t) { | ||
| throwable = t; | ||
| LOG.error("Server transitioning to EXCEPTION state due to", t); | ||
| // TODO(jiacheng): will the server keep serving or just die itself? |
There was a problem hiding this comment.
We should let the application decide. Therefore, we should pass an UncaughtExceptionHandler from the application.
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Outdated
Show resolved
Hide resolved
|
@szetszwo Thanks for the comments! What do you think about exposing the exception on the |
|
@jiacheliu3 , What do you think about exposing the exception on the It is a good idea. We can do it by passing an |
| private static final Method NEW_RAFT_SERVER_METHOD = initNewRaftServerMethod(); | ||
|
|
||
| private static Method initNewRaftServerMethod() { | ||
| // TODO(jiacheng): backward compatibility? |
There was a problem hiding this comment.
Consider how to keep the most backward compatibility as possible
| return this; | ||
| } | ||
|
|
||
| public Builder setUncaughtExceptionHandler(UncaughtExceptionHandler exceptionHandler) { |
There was a problem hiding this comment.
Not sure how much UT would be necessary. If necessary, I'll try writing something similar to RaftStateMachineExceptionTests using a MiniRaftCluster which creates a RaftServer with my user defined UncaughtExceptionHandler, and when I throw an exception internally, check if the handler is able to catch that.
szetszwo
left a comment
There was a problem hiding this comment.
@jiacheliu3 , thanks for the update! After checked the Java API in more details, it seems better to pass a ThreadGroup instead a UncaughtExceptionHandler (sorry that it was a bad idea); see https://issues.apache.org/jira/secure/attachment/13049347/733_review.patch
| static final Logger LOG = LoggerFactory.getLogger(Daemon.class); | ||
| public static final Thread.UncaughtExceptionHandler LOG_EXCEPTION = | ||
| (t, e) -> LOG.error(t.getName() + " threw an uncaught exception", e); |
There was a problem hiding this comment.
JVM already has a default UncaughtExceptionHandler (which prints out the exception to the System.err). Also, it has an API https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#setDefaultUncaughtExceptionHandler-java.lang.Thread.UncaughtExceptionHandler- . We probably should not have a different default. If we do that, applications using Thread.setDefaultUncaughtExceptionHandler will not work.
@szetszwo Thanks for the suggestion that makes sense to me. One subtle implication I'm not sure about is, if the I'm not 100% sure what that suggests. Specifically I'm not sure if threads from In contrast if we specify an |
codings-dan
left a comment
There was a problem hiding this comment.
@jiacheliu3 Thanks a lot for working on this , overall lgtm, left some comments for ref
| private final LogAppender logAppender; | ||
|
|
||
| LogAppenderDaemon(LogAppender logAppender) { | ||
| LogAppenderDaemon(LogAppender logAppender, RaftServer.Division server) { |
There was a problem hiding this comment.
Is it possible to pass in a ThreadGroup here instead of passing in a RaftServer.Division
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerProxy.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
@jiacheliu3 , thanks for the update! Some comments inlined.
ratis-server-api/src/main/java/org/apache/ratis/server/RaftServer.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerProxy.java
Outdated
Show resolved
Hide resolved
| private final ExecutorService executor; | ||
|
|
||
| private final JvmPauseMonitor pauseMonitor; | ||
| @Nullable |
There was a problem hiding this comment.
Let's create a default ThreadGroup and make it non-nullable.
There was a problem hiding this comment.
Actually I figure a null here is fine. doc specifies that is the ThreadGroup arg is null, the current thread's group is inherited. That behavior can be better than we guess a default here?
There was a problem hiding this comment.
Interestingly... OpenJDK does not really respect a null ThreadGroup here and throws NPE on me. I will change this to non-null.
| public static RaftServerProxy newRaftServer( | ||
| RaftPeerId id, RaftGroup group, RaftStorage.StartupOption option, StateMachine.Registry stateMachineRegistry, | ||
| RaftProperties properties, Parameters parameters) throws IOException { | ||
| RaftProperties properties, Parameters parameters, ThreadGroup threadGroup) |
There was a problem hiding this comment.
Please reorder the parameters:
ThreadGroup threadGroup, RaftProperties properties, Parameters parameters)There was a problem hiding this comment.
Rearranged together with all the places that have the same argument list
| final TimeDuration leaderStepDownWaitTime = RaftServerConfigKeys.LeaderElection.leaderStepDownWaitTime(properties); | ||
| this.pauseMonitor = new JvmPauseMonitor(id, | ||
| extraSleep -> handleJvmPause(extraSleep, rpcSlownessTimeout, leaderStepDownWaitTime)); | ||
| this.threadGroup = threadGroup == null ? new ThreadGroup("raft-server-proxy") : threadGroup; |
There was a problem hiding this comment.
i figure it's safer to do the null check here than in the Builder
There was a problem hiding this comment.
Here I'm just applying a ThreadGroup that has no unhandledException(). In the future if it's proved to be needed I can submit another PR for a better default ThreadGroup.
There was a problem hiding this comment.
No default uncaughtException(..) probably is okay since it will use the system default.
|
@szetszwo PTAL if the changes look good to me. I suppose the comments are mostly resolved so it'll be a quick one :) Thanks! |
szetszwo
left a comment
There was a problem hiding this comment.
@jiacheliu3 , thanks for the update! Just some minor comments inlined.
ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerProxy.java
Outdated
Show resolved
Hide resolved
| final TimeDuration leaderStepDownWaitTime = RaftServerConfigKeys.LeaderElection.leaderStepDownWaitTime(properties); | ||
| this.pauseMonitor = new JvmPauseMonitor(id, | ||
| extraSleep -> handleJvmPause(extraSleep, rpcSlownessTimeout, leaderStepDownWaitTime)); | ||
| this.threadGroup = threadGroup == null ? new ThreadGroup("raft-server-proxy") : threadGroup; |
There was a problem hiding this comment.
No default uncaughtException(..) probably is okay since it will use the system default.
ratis-server/src/test/java/org/apache/ratis/server/impl/MiniRaftCluster.java
Outdated
Show resolved
Hide resolved
|
This test passes on my local PC, I believe this one is flaky. |
|
Let me re-run the failed-job. |
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
(cherry picked from commit 31eff22)
(cherry picked from commit 31eff22)
What changes were proposed in this pull request?
ThreadGroupto control Ratis threads consistently with the application. The app can specifyThreadGroup.uncaughtException()to catch uncaught exceptions from ratis.Daemonthreads will be created with the app-specifiedThreadGroup, so the application will decide what to do whenRaftServeris seeing an unhandled errorDaemonthreads will have thisThreadGroupand others will not, so unimportant transient errors will stay inside Ratis abstractionWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-1709
How was this patch tested?
(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)