feat: migrate CI from SSH to SSM with SSH fallback#20555
Merged
Conversation
3fbf4e2 to
05e27d1
Compare
bcdc854 to
785532a
Compare
5895b43 to
504b146
Compare
504b146 to
08afd61
Compare
08afd61 to
49e3f7e
Compare
charlielye
requested changes
Mar 5, 2026
49e3f7e to
3adba1f
Compare
3adba1f to
f8f4907
Compare
da34f09 to
9102a45
Compare
charlielye
approved these changes
Mar 6, 2026
ludamad
reviewed
Mar 6, 2026
f504929 to
db765a8
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
db765a8 to
dfa23c2
Compare
PhilWindle
pushed a commit
that referenced
this pull request
Mar 10, 2026
## Summary Fixes a race condition in the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests that caused intermittent timeouts waiting for slashing offenses to be detected. ## Root Cause Investigated CI failure [2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit `f71c235a670d`, merge queue for PR #20555). This run **did include** the fix from #20990, yet the `duplicate_attestation_slash` test still failed with a timeout after ~478 seconds. The failure sequence from the logs: 1. `awaitEpochWithProposer` found the malicious proposer at **slot 14** (epoch 7) 2. The function returned, having warped L1 time to the start of epoch 7 3. Sequencers were then started (`await Promise.all(nodes.map(n => n.getSequencer()!.start()))`) 4. By the time sequencers were ready to build blocks, L1 time had advanced past epoch 7 into **epoch 8** (first block built at slot 16) 5. The malicious proposer was never selected in epoch 8+, so no duplicate proposals/attestations were created 6. The slasher had nothing to detect, and the test timed out The core issue: `awaitEpochWithProposer` warped to the target epoch and returned, but starting sequencers takes real time. With only 2 slots per epoch (48s total), the epoch passed before sequencers could act. ## Fix Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and changed the approach to a two-phase pattern: 1. **Find phase**: The function now checks the **next** epoch's slots (N+1) instead of the current epoch's (N). When the target proposer is found, it returns `{ targetEpoch }` while staying at epoch N -- one full epoch before the target. 2. **Start + warp phase** (in the test): After the function returns, sequencers are started while still one epoch before the target. Only then does the test warp to the target epoch via `advanceToEpoch(targetEpoch)`. This eliminates the race because sequencers are already running when the target epoch begins. The function can safely query future epoch slots because `epochCache.getProposerAttesterAddressInSlot` works for any slot within the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and we only look 1 epoch ahead. ## Changes - **`shared.ts`**: Renamed `awaitEpochWithProposer` -> `advanceToEpochBeforeProposer`. Now checks next epoch's slots and returns `{ targetEpoch: EpochNumber }` instead of `void`. - **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers before warping to target epoch. - **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since the two malicious nodes sharing the same key each self-attest to their own (different) checkpoint proposals, causing honest nodes to also detect a `DUPLICATE_ATTESTATION` offense. Fixes A-632
AztecBot
pushed a commit
that referenced
this pull request
Mar 19, 2026
## Summary Fixes a race condition in the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests that caused intermittent timeouts waiting for slashing offenses to be detected. ## Root Cause Investigated CI failure [2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit `f71c235a670d`, merge queue for PR #20555). This run **did include** the fix from #20990, yet the `duplicate_attestation_slash` test still failed with a timeout after ~478 seconds. The failure sequence from the logs: 1. `awaitEpochWithProposer` found the malicious proposer at **slot 14** (epoch 7) 2. The function returned, having warped L1 time to the start of epoch 7 3. Sequencers were then started (`await Promise.all(nodes.map(n => n.getSequencer()!.start()))`) 4. By the time sequencers were ready to build blocks, L1 time had advanced past epoch 7 into **epoch 8** (first block built at slot 16) 5. The malicious proposer was never selected in epoch 8+, so no duplicate proposals/attestations were created 6. The slasher had nothing to detect, and the test timed out The core issue: `awaitEpochWithProposer` warped to the target epoch and returned, but starting sequencers takes real time. With only 2 slots per epoch (48s total), the epoch passed before sequencers could act. ## Fix Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and changed the approach to a two-phase pattern: 1. **Find phase**: The function now checks the **next** epoch's slots (N+1) instead of the current epoch's (N). When the target proposer is found, it returns `{ targetEpoch }` while staying at epoch N -- one full epoch before the target. 2. **Start + warp phase** (in the test): After the function returns, sequencers are started while still one epoch before the target. Only then does the test warp to the target epoch via `advanceToEpoch(targetEpoch)`. This eliminates the race because sequencers are already running when the target epoch begins. The function can safely query future epoch slots because `epochCache.getProposerAttesterAddressInSlot` works for any slot within the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and we only look 1 epoch ahead. ## Changes - **`shared.ts`**: Renamed `awaitEpochWithProposer` -> `advanceToEpochBeforeProposer`. Now checks next epoch's slots and returns `{ targetEpoch: EpochNumber }` instead of `void`. - **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers before warping to target epoch. - **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since the two malicious nodes sharing the same key each self-attest to their own (different) checkpoint proposals, causing honest nodes to also detect a `DUPLICATE_ATTESTATION` offense. Fixes A-632
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates all CI build orchestration from SSH to AWS Systems Manager (SSM) as the default, with SSH preserved as a fallback via
CI_USE_SSH=1. OIDC federation replaces static AWS credentials. The CI dashboard now reads logs exclusively from Redis + S3 (disk reads removed).Changes by file
.github/workflows/ci3.ymlaws-actions/configure-aws-credentials@v4,ci-ssmlabel override,CI_USE_SSHfrom repo variable.github/ci3.shCI_USE_SSH=1; validatesCI3_INSTANCE_PROFILE_NAMEandCI3_SECURITY_GROUP_IDin SSM modeci.shbootstrap_ssm_with_link; SSH fallback viaCI_USE_SSH=1; SSM shell commands; pre-generatedCI_LOG_IDci3/bootstrap_ssmbootstrap_ec2: launches EC2 without key pair, waits for SSM agent, sends command viasend-command, polls for completion, handles spot evictionci3/aws_request_instance_typeKeyName,IamInstanceProfile,SecurityGroupIds, skip SSH wait when no keyci3/source_cachecache_persistentalways writes Redis + S3 (no flags); S3 transfer functions added; disk functions kept as legacy dead codeci3/source_redisCI_SSM_MODE=1ci3/cache_uploadCI_SSM_MODE=1ci3/log_ci_runCI_LOG_IDif setci3/dashboard/rk.pylist_available_flowsreads S3/logs-diskvolume mountci3/dashboard/requirements.txtboto3ci3/dashboard/deploy-test.shHow to test
ci-ssmlabel to this PR to force SSM modeCI_USE_SSHrepo variable in Settings > Variables > Actionsci.aztec-labs.com:8082Rollback
CI_USE_SSH=1in repo variables (Settings > Variables > Actions) to revert globallyci-ssmlabel to use global setting🤖 Generated with Claude Code