Skip to content

feat: migrate CI from SSH to SSM with SSH fallback#20555

Merged
ludamad merged 1 commit intonextfrom
rq/ssm-migration
Mar 7, 2026
Merged

feat: migrate CI from SSH to SSM with SSH fallback#20555
ludamad merged 1 commit intonextfrom
rq/ssm-migration

Conversation

@randyquaye
Copy link
Copy Markdown
Collaborator

@randyquaye randyquaye commented Feb 16, 2026

Summary

Migrates all CI build orchestration from SSH to AWS Systems Manager (SSM) as the default, with SSH preserved as a fallback via CI_USE_SSH=1. OIDC federation replaces static AWS credentials. The CI dashboard now reads logs exclusively from Redis + S3 (disk reads removed).

Changes by file

File Change Consequence
.github/workflows/ci3.yml OIDC permissions, aws-actions/configure-aws-credentials@v4, ci-ssm label override, CI_USE_SSH from repo variable SSM mode is default; SSH via variable toggle; per-PR override via label
.github/ci3.sh AWS creds conditional on CI_USE_SSH=1; validates CI3_INSTANCE_PROFILE_NAME and CI3_SECURITY_GROUP_ID in SSM mode No static AWS keys needed in SSM mode
ci.sh All CI modes use bootstrap_ssm_with_link; SSH fallback via CI_USE_SSH=1; SSM shell commands; pre-generated CI_LOG_ID Single entry point switches between SSM and SSH
ci3/bootstrap_ssm New file — SSM equivalent of bootstrap_ec2: launches EC2 without key pair, waits for SSM agent, sends command via send-command, polls for completion, handles spot eviction Replaces SSH for remote build execution
ci3/aws_request_instance_type Conditional KeyName, IamInstanceProfile, SecurityGroupIds, skip SSH wait when no key Supports both SSH and SSM launch modes
ci3/source_cache cache_persistent always writes Redis + S3 (no flags); S3 transfer functions added; disk functions kept as legacy dead code Logs always land in S3 regardless of mode
ci3/source_redis Skip SSH tunnel when CI_SSM_MODE=1 Direct Redis access via security group
ci3/cache_upload Skip AWS key check when CI_SSM_MODE=1 Uses instance profile (IMDS) for credentials
ci3/log_ci_run Use pre-generated CI_LOG_ID if set SSM can't return values mid-run
ci3/dashboard/rk.py Removed all disk-based reads; Redis -> S3 only; list_available_flows reads S3 Dashboard no longer needs /logs-disk volume mount
ci3/dashboard/requirements.txt Added boto3 S3 SDK dependency
ci3/dashboard/deploy-test.sh New file — deploys test dashboard instance via SSM on ports 8082/8083 Test dashboard without SSH access

How to test

  • Add ci-ssm label to this PR to force SSM mode
  • Toggle CI_USE_SSH repo variable in Settings > Variables > Actions
  • Test dashboard at ci.aztec-labs.com:8082
  • Verify logs appear in S3 and on dashboard after a CI run

Rollback

  • Set CI_USE_SSH=1 in repo variables (Settings > Variables > Actions) to revert globally
  • All SSH infrastructure (keys, secrets, bastion) remains intact as fallback
  • Per-PR: remove ci-ssm label to use global setting

🤖 Generated with Claude Code

@randyquaye randyquaye marked this pull request as ready for review February 19, 2026 13:03
@randyquaye randyquaye force-pushed the rq/ssm-migration branch 6 times, most recently from 3fbf4e2 to 05e27d1 Compare February 19, 2026 15:54
@randyquaye randyquaye force-pushed the rq/ssm-migration branch 2 times, most recently from bcdc854 to 785532a Compare March 5, 2026 12:15
@randyquaye randyquaye changed the title feat: migrate CI from SSH to SSM feat: migrate CI from SSH to SSM with SSH fallback Mar 5, 2026
@randyquaye randyquaye force-pushed the rq/ssm-migration branch 3 times, most recently from 5895b43 to 504b146 Compare March 5, 2026 13:03
@randyquaye randyquaye marked this pull request as draft March 5, 2026 13:17
@randyquaye randyquaye marked this pull request as ready for review March 5, 2026 13:30
Comment thread ci.sh Outdated
Comment thread ci.sh Outdated
Comment thread ci.sh Outdated
Comment thread ci3/log_ci_run Outdated
Comment thread ci3/cache_upload Outdated
Comment thread ci3/cache_upload Outdated
Comment thread ci3/aws_request_instance_type Outdated
Comment thread ci3/bootstrap_ssm Outdated
Comment thread .github/ci3.sh
@randyquaye randyquaye added the ci-ssm CI: Use SSM mode for CI3 label Mar 5, 2026
@randyquaye randyquaye removed the ci-ssm CI: Use SSM mode for CI3 label Mar 6, 2026
@randyquaye randyquaye force-pushed the rq/ssm-migration branch 5 times, most recently from da34f09 to 9102a45 Compare March 6, 2026 12:18
@randyquaye randyquaye requested a review from charlielye March 6, 2026 12:36
@randyquaye randyquaye added this pull request to the merge queue Mar 6, 2026
Comment thread ci3/source_cache
@randyquaye randyquaye removed this pull request from the merge queue due to a manual request Mar 6, 2026
@ludamad ludamad added the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026
@ludamad ludamad enabled auto-merge March 6, 2026 19:42
@AztecBot AztecBot removed the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026
@ludamad ludamad force-pushed the rq/ssm-migration branch from f504929 to db765a8 Compare March 6, 2026 22:45
@ludamad ludamad added the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026
@AztecBot AztecBot removed the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ludamad ludamad added this pull request to the merge queue Mar 7, 2026
Merged via the queue into next with commit f71c235 Mar 7, 2026
18 checks passed
@ludamad ludamad deleted the rq/ssm-migration branch March 7, 2026 00:57
PhilWindle pushed a commit that referenced this pull request Mar 10, 2026
## Summary

Fixes a race condition in the `duplicate_attestation_slash` and
`duplicate_proposal_slash` e2e tests that caused intermittent timeouts
waiting for slashing offenses to be detected.

## Root Cause

Investigated CI failure
[2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit
`f71c235a670d`, merge queue for PR #20555). This run **did include** the
fix from #20990, yet the `duplicate_attestation_slash` test still failed
with a timeout after ~478 seconds.

The failure sequence from the logs:
1. `awaitEpochWithProposer` found the malicious proposer at **slot 14**
(epoch 7)
2. The function returned, having warped L1 time to the start of epoch 7
3. Sequencers were then started (`await Promise.all(nodes.map(n =>
n.getSequencer()!.start()))`)
4. By the time sequencers were ready to build blocks, L1 time had
advanced past epoch 7 into **epoch 8** (first block built at slot 16)
5. The malicious proposer was never selected in epoch 8+, so no
duplicate proposals/attestations were created
6. The slasher had nothing to detect, and the test timed out

The core issue: `awaitEpochWithProposer` warped to the target epoch and
returned, but starting sequencers takes real time. With only 2 slots per
epoch (48s total), the epoch passed before sequencers could act.

## Fix

Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and
changed the approach to a two-phase pattern:

1. **Find phase**: The function now checks the **next** epoch's slots
(N+1) instead of the current epoch's (N). When the target proposer is
found, it returns `{ targetEpoch }` while staying at epoch N -- one full
epoch before the target.

2. **Start + warp phase** (in the test): After the function returns,
sequencers are started while still one epoch before the target. Only
then does the test warp to the target epoch via
`advanceToEpoch(targetEpoch)`.

This eliminates the race because sequencers are already running when the
target epoch begins.

The function can safely query future epoch slots because
`epochCache.getProposerAttesterAddressInSlot` works for any slot within
the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and
we only look 1 epoch ahead.

## Changes

- **`shared.ts`**: Renamed `awaitEpochWithProposer` ->
`advanceToEpochBeforeProposer`. Now checks next epoch's slots and
returns `{ targetEpoch: EpochNumber }` instead of `void`.
- **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers
before warping to target epoch.
- **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered
offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since
the two malicious nodes sharing the same key each self-attest to their
own (different) checkpoint proposals, causing honest nodes to also
detect a `DUPLICATE_ATTESTATION` offense.

Fixes A-632
AztecBot pushed a commit that referenced this pull request Mar 19, 2026
## Summary

Fixes a race condition in the `duplicate_attestation_slash` and
`duplicate_proposal_slash` e2e tests that caused intermittent timeouts
waiting for slashing offenses to be detected.

## Root Cause

Investigated CI failure
[2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit
`f71c235a670d`, merge queue for PR #20555). This run **did include** the
fix from #20990, yet the `duplicate_attestation_slash` test still failed
with a timeout after ~478 seconds.

The failure sequence from the logs:
1. `awaitEpochWithProposer` found the malicious proposer at **slot 14**
(epoch 7)
2. The function returned, having warped L1 time to the start of epoch 7
3. Sequencers were then started (`await Promise.all(nodes.map(n =>
n.getSequencer()!.start()))`)
4. By the time sequencers were ready to build blocks, L1 time had
advanced past epoch 7 into **epoch 8** (first block built at slot 16)
5. The malicious proposer was never selected in epoch 8+, so no
duplicate proposals/attestations were created
6. The slasher had nothing to detect, and the test timed out

The core issue: `awaitEpochWithProposer` warped to the target epoch and
returned, but starting sequencers takes real time. With only 2 slots per
epoch (48s total), the epoch passed before sequencers could act.

## Fix

Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and
changed the approach to a two-phase pattern:

1. **Find phase**: The function now checks the **next** epoch's slots
(N+1) instead of the current epoch's (N). When the target proposer is
found, it returns `{ targetEpoch }` while staying at epoch N -- one full
epoch before the target.

2. **Start + warp phase** (in the test): After the function returns,
sequencers are started while still one epoch before the target. Only
then does the test warp to the target epoch via
`advanceToEpoch(targetEpoch)`.

This eliminates the race because sequencers are already running when the
target epoch begins.

The function can safely query future epoch slots because
`epochCache.getProposerAttesterAddressInSlot` works for any slot within
the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and
we only look 1 epoch ahead.

## Changes

- **`shared.ts`**: Renamed `awaitEpochWithProposer` ->
`advanceToEpochBeforeProposer`. Now checks next epoch's slots and
returns `{ targetEpoch: EpochNumber }` instead of `void`.
- **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers
before warping to target epoch.
- **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered
offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since
the two malicious nodes sharing the same key each self-attest to their
own (different) checkpoint proposals, causing honest nodes to also
detect a `DUPLICATE_ATTESTATION` offense.

Fixes A-632
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants