feat: migrate CI from SSH to SSM with SSH fallback by randyquaye · Pull Request #20555 · AztecProtocol/aztec-packages

randyquaye · 2026-02-16T16:05:57Z

Summary

Migrates all CI build orchestration from SSH to AWS Systems Manager (SSM) as the default, with SSH preserved as a fallback via CI_USE_SSH=1. OIDC federation replaces static AWS credentials. The CI dashboard now reads logs exclusively from Redis + S3 (disk reads removed).

Changes by file

File	Change	Consequence
`.github/workflows/ci3.yml`	OIDC permissions, `aws-actions/configure-aws-credentials@v4`, `ci-ssm` label override, `CI_USE_SSH` from repo variable	SSM mode is default; SSH via variable toggle; per-PR override via label
`.github/ci3.sh`	AWS creds conditional on `CI_USE_SSH=1`; validates `CI3_INSTANCE_PROFILE_NAME` and `CI3_SECURITY_GROUP_ID` in SSM mode	No static AWS keys needed in SSM mode
`ci.sh`	All CI modes use `bootstrap_ssm_with_link`; SSH fallback via `CI_USE_SSH=1`; SSM shell commands; pre-generated `CI_LOG_ID`	Single entry point switches between SSM and SSH
`ci3/bootstrap_ssm`	New file — SSM equivalent of `bootstrap_ec2`: launches EC2 without key pair, waits for SSM agent, sends command via `send-command`, polls for completion, handles spot eviction	Replaces SSH for remote build execution
`ci3/aws_request_instance_type`	Conditional `KeyName`, `IamInstanceProfile`, `SecurityGroupIds`, skip SSH wait when no key	Supports both SSH and SSM launch modes
`ci3/source_cache`	`cache_persistent` always writes Redis + S3 (no flags); S3 transfer functions added; disk functions kept as legacy dead code	Logs always land in S3 regardless of mode
`ci3/source_redis`	Skip SSH tunnel when `CI_SSM_MODE=1`	Direct Redis access via security group
`ci3/cache_upload`	Skip AWS key check when `CI_SSM_MODE=1`	Uses instance profile (IMDS) for credentials
`ci3/log_ci_run`	Use pre-generated `CI_LOG_ID` if set	SSM can't return values mid-run
`ci3/dashboard/rk.py`	Removed all disk-based reads; Redis -> S3 only; `list_available_flows` reads S3	Dashboard no longer needs `/logs-disk` volume mount
`ci3/dashboard/requirements.txt`	Added `boto3`	S3 SDK dependency
`ci3/dashboard/deploy-test.sh`	New file — deploys test dashboard instance via SSM on ports 8082/8083	Test dashboard without SSH access

How to test

Add ci-ssm label to this PR to force SSM mode
Toggle CI_USE_SSH repo variable in Settings > Variables > Actions
Test dashboard at ci.aztec-labs.com:8082
Verify logs appear in S3 and on dashboard after a CI run

Rollback

Set CI_USE_SSH=1 in repo variables (Settings > Variables > Actions) to revert globally
All SSH infrastructure (keys, secrets, bastion) remains intact as fallback
Per-PR: remove ci-ssm label to use global setting

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## Summary Fixes a race condition in the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests that caused intermittent timeouts waiting for slashing offenses to be detected. ## Root Cause Investigated CI failure [2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit `f71c235a670d`, merge queue for PR #20555). This run **did include** the fix from #20990, yet the `duplicate_attestation_slash` test still failed with a timeout after ~478 seconds. The failure sequence from the logs: 1. `awaitEpochWithProposer` found the malicious proposer at **slot 14** (epoch 7) 2. The function returned, having warped L1 time to the start of epoch 7 3. Sequencers were then started (`await Promise.all(nodes.map(n => n.getSequencer()!.start()))`) 4. By the time sequencers were ready to build blocks, L1 time had advanced past epoch 7 into **epoch 8** (first block built at slot 16) 5. The malicious proposer was never selected in epoch 8+, so no duplicate proposals/attestations were created 6. The slasher had nothing to detect, and the test timed out The core issue: `awaitEpochWithProposer` warped to the target epoch and returned, but starting sequencers takes real time. With only 2 slots per epoch (48s total), the epoch passed before sequencers could act. ## Fix Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and changed the approach to a two-phase pattern: 1. **Find phase**: The function now checks the **next** epoch's slots (N+1) instead of the current epoch's (N). When the target proposer is found, it returns `{ targetEpoch }` while staying at epoch N -- one full epoch before the target. 2. **Start + warp phase** (in the test): After the function returns, sequencers are started while still one epoch before the target. Only then does the test warp to the target epoch via `advanceToEpoch(targetEpoch)`. This eliminates the race because sequencers are already running when the target epoch begins. The function can safely query future epoch slots because `epochCache.getProposerAttesterAddressInSlot` works for any slot within the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and we only look 1 epoch ahead. ## Changes - **`shared.ts`**: Renamed `awaitEpochWithProposer` -> `advanceToEpochBeforeProposer`. Now checks next epoch's slots and returns `{ targetEpoch: EpochNumber }` instead of `void`. - **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers before warping to target epoch. - **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since the two malicious nodes sharing the same key each self-attest to their own (different) checkpoint proposals, causing honest nodes to also detect a `DUPLICATE_ATTESTATION` offense. Fixes A-632

randyquaye marked this pull request as ready for review February 19, 2026 13:03

randyquaye requested a review from charlielye as a code owner February 19, 2026 13:03

randyquaye force-pushed the rq/ssm-migration branch 6 times, most recently from 3fbf4e2 to 05e27d1 Compare February 19, 2026 15:54

randyquaye force-pushed the rq/ssm-migration branch 2 times, most recently from bcdc854 to 785532a Compare March 5, 2026 12:15

randyquaye changed the title ~~feat: migrate CI from SSH to SSM~~ feat: migrate CI from SSH to SSM with SSH fallback Mar 5, 2026

randyquaye force-pushed the rq/ssm-migration branch 3 times, most recently from 5895b43 to 504b146 Compare March 5, 2026 13:03

randyquaye marked this pull request as draft March 5, 2026 13:17

randyquaye force-pushed the rq/ssm-migration branch from 504b146 to 08afd61 Compare March 5, 2026 13:18

randyquaye marked this pull request as ready for review March 5, 2026 13:30

randyquaye force-pushed the rq/ssm-migration branch from 08afd61 to 49e3f7e Compare March 5, 2026 18:04

charlielye requested changes Mar 5, 2026

View reviewed changes

randyquaye force-pushed the rq/ssm-migration branch from 49e3f7e to 3adba1f Compare March 5, 2026 19:51

randyquaye added the ci-ssm CI: Use SSM mode for CI3 label Mar 5, 2026

randyquaye force-pushed the rq/ssm-migration branch from 3adba1f to f8f4907 Compare March 5, 2026 20:18

randyquaye removed the ci-ssm CI: Use SSM mode for CI3 label Mar 6, 2026

randyquaye force-pushed the rq/ssm-migration branch 5 times, most recently from da34f09 to 9102a45 Compare March 6, 2026 12:18

randyquaye requested a review from charlielye March 6, 2026 12:36

charlielye approved these changes Mar 6, 2026

View reviewed changes

randyquaye added this pull request to the merge queue Mar 6, 2026

ludamad reviewed Mar 6, 2026

View reviewed changes

Comment thread ci3/source_cache

randyquaye removed this pull request from the merge queue due to a manual request Mar 6, 2026

ludamad added the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026

ludamad enabled auto-merge March 6, 2026 19:42

AztecBot removed the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026

AztecBot had a problem deploying to master March 6, 2026 22:45 — with GitHub Actions Failure

ludamad force-pushed the rq/ssm-migration branch from f504929 to db765a8 Compare March 6, 2026 22:45

ludamad added the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026

AztecBot removed the ci-release-pr Creates a development tag and runs the release suite label Mar 6, 2026

AztecBot temporarily deployed to master March 6, 2026 22:45 — with GitHub Actions Inactive

feat: migrate CI from SSH to SSM with SSH fallback

dfa23c2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

randyquaye force-pushed the rq/ssm-migration branch from db765a8 to dfa23c2 Compare March 7, 2026 00:19

ludamad added this pull request to the merge queue Mar 7, 2026

Merged via the queue into next with commit f71c235 Mar 7, 2026
18 checks passed

ludamad deleted the rq/ssm-migration branch March 7, 2026 00:57

AztecBot mentioned this pull request Mar 7, 2026

fix: update nightly debug build to use OIDC for SSM mode #21233

Closed

mrzeszutko mentioned this pull request Mar 10, 2026

chore: deflake duplicate attestations and proposals slash tests #21294

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate CI from SSH to SSM with SSH fallback#20555

feat: migrate CI from SSH to SSM with SSH fallback#20555
ludamad merged 1 commit intonextfrom
rq/ssm-migration

randyquaye commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

randyquaye commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by file

How to test

Rollback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

randyquaye commented Feb 16, 2026 •

edited

Loading