Added timing data and more granular stages to SegmentReplicationState#4222
Conversation
Gradle Check (Jenkins) Run Completed with:
|
|
Looks unrelated, refiring. |
Gradle Check (Jenkins) Run Completed with:
|
|
Looks like a legit failure. @kartg you may want to look |
Gradle Check (Jenkins) Run Completed with:
|
|
Looks like two flaky test failures: and one legitimate test failure related to segment replication - |
Gradle Check (Jenkins) Run Completed with:
|
|
One failing test, unrelated to segment replication, but i can consistently repro the failure locally with the given seed |
@kartg could you please create an issue for this flaky test, I suspect it has been introduced in [1], first time spotted there [2] [1] #4183 |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
9b6cbff to
c395ac1
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
|
@reta @dreamer-89 @mch2 test issues resolved, ready for your 👀 |
| * IllegalStateException to fail the shard | ||
| */ | ||
| if (diff.different.isEmpty() == false) { | ||
| getFilesListener.onFailure( |
There was a problem hiding this comment.
Certainly an existing code, but it seems like there is a bug here: we don't thrown an exception but call the listener direclty and move on ... @kartg wdyt?
There was a problem hiding this comment.
@reta Good catch! We shouldn't be proceeding if the failure listener is invoked. Would you mind if i incorporated the fix for this as a follow-up PR? I'd like to keep the scope of this PR focused on the seg-rep stages and timing data.
| state.setStage(SegmentReplicationState.Stage.GET_CHECKPOINT_INFO); | ||
| source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener); | ||
|
|
||
| checkpointInfoListener.whenComplete(checkpointInfo -> getFiles(checkpointInfo, getFilesListener), listener::onFailure); |
There was a problem hiding this comment.
I think there is an issue between stage demarcation and handlng the unhappy path. @kartg please correct me if I am wrong but in case of failure checkpointInfoListener (and others) would delegate to listener::onFailure without marking the current stage as "finished", correct?
There was a problem hiding this comment.
There was a problem hiding this comment.
That's fine (I think), the dangling stage is what concerns me: since we publish timing on success only, we do nothing on failure (we don't even know if stage was run, which could be even more important than success for troubleshooting). In any case, leaving the decision up to you.
…#4222) * Added timing data and more granular stages to SegmentReplicationState This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Fixing SegmentReplicationTarget tests Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Incorporated PR feedback Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Fixing SegmentReplicationTargetService tests Signed-off-by: Kartik Ganesh <gkart@amazon.com> Signed-off-by: Kartik Ganesh <gkart@amazon.com> (cherry picked from commit a2ba3a8)
…eplicationState (#4367) * Added timing data and more granular stages to SegmentReplicationState (#4222) * Added timing data and more granular stages to SegmentReplicationState This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level. Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Fixing SegmentReplicationTarget tests Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Incorporated PR feedback Signed-off-by: Kartik Ganesh <gkart@amazon.com> * Fixing SegmentReplicationTargetService tests Signed-off-by: Kartik Ganesh <gkart@amazon.com> Signed-off-by: Kartik Ganesh <gkart@amazon.com> (cherry picked from commit a2ba3a8) * Update changelog for backport pr. Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Kartik Ganesh <gkart@amazon.com> Co-authored-by: Marc Handalian <handalm@amazon.com>
Description
This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.
Signed-off-by: Kartik Ganesh gkart@amazon.com
Issues Resolved
Part of #2583
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.