test: add E2E chaos and networking resilience tests#1545
Open
AlexCheema wants to merge 24 commits intomainfrom
Open
test: add E2E chaos and networking resilience tests#1545AlexCheema wants to merge 24 commits intomainfrom
AlexCheema wants to merge 24 commits intomainfrom
Conversation
Add a Python/asyncio E2E test framework that spins up 2-node exo clusters in Docker Compose and verifies cluster formation, discovery, election, and API health. Includes a no-internet chaos test using DNS blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The runner was running out of disk space during the Docker image build (Rust compilation + Python deps). Remove unused toolchains first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clean up Rust target/ and cargo registry after uv sync in the same RUN command so build artifacts aren't committed to the layer (~1-2 GB saved). Also remove more unused toolchains from the CI runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use iptables to block all outbound traffic except private subnets and multicast (for mDNS discovery). Verify internet is blocked by curling huggingface.co from inside each container and checking exo logs for "Internet connectivity: False". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Launch mlx-community/Qwen3-0.6B-4bit on the cluster, send a chat completion with seed=42 and temperature=0, and verify the output matches a committed snapshot. Tests inference determinism end-to-end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX CPU inference on x86_64 is too slow for CI runners (~10min+ for a single request). Mark the inference snapshot test as slow so it's skipped by default. Run with --slow or E2E_SLOW=1 on Apple Silicon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st collection The tests/start_distributed_test.py script calls sys.exit() at module level, which crashes pytest collection. Exclude it via collect_ignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e/snapshot.py with assert_snapshot() for deterministic regression testing. On first run, saves inference output as the expected snapshot. On subsequent runs, compares against it with unified diff on mismatch. Set UPDATE_SNAPSHOTS=1 or pass --update-snapshots to regenerate. Refactor test_inference_snapshot.py to use the shared infrastructure and drop temperature=0 in favor of seed-only determinism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd edge cases Expand e2e snapshot coverage beyond the single 'What is 2+2?' test: - test_snapshot_code_gen.py: code generation prompt (max_tokens=64) - test_snapshot_reasoning.py: step-by-step math reasoning (max_tokens=64) - test_snapshot_long_output.py: longer response with max_tokens=128 - test_snapshot_edge.py: single word, special chars, and unicode prompts All use seed=42 and the shared assert_snapshot() infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MLX already supports x86 CPU via mlx[cpu] and the Dockerfile has the
GCC workaround for CPU JIT. The only barriers were the 'slow' markers
causing tests to be skipped in CI.
Changes:
- Remove 'slow' marker from all snapshot tests so they run by default
- Make snapshots architecture-aware (snapshots/{arch}/{name}.json) since
floating-point results differ between x86_64 and arm64
- Store architecture in snapshot metadata
- Increase CI timeout from 30 to 45 minutes for model download + CPU inference
- Update docstrings to remove Apple Silicon requirement
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-build the Docker image using docker/build-push-action with GitHub Actions cache (type=gha). On cache hit, the image loads from cache instead of rebuilding (~12min → seconds). Changes: - CI: set up buildx, build image with --cache-from/--cache-to type=gha - docker-compose.yml: add image tag (exo-e2e:latest) so compose uses the pre-built image instead of rebuilding - conftest.py: Cluster.build() skips if exo-e2e:latest already exists (pre-built in CI), falls back to docker compose build for local dev Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add e2e snapshot test that exercises 3 different model architectures
to catch model-specific regressions:
- SmolLM2-135M-Instruct (tiny llama, bf16, ~269MB)
- Llama-3.2-1B-Instruct-4bit (small llama, 4bit, ~730MB)
- gemma-2-2b-it-4bit (gemma2 architecture, 4bit, ~1.5GB)
Each model gets its own snapshot file. All use the same prompt
("What is the capital of France?"), seed=42, max_tokens=32.
Also adds model cards for SmolLM2-135M-Instruct and gemma-2-2b-it-4bit
(Llama-3.2-1B-Instruct-4bit already had one).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two issues prevented MLX CPU from working on x86_64 in Docker: 1. Missing BLAS/LAPACK libraries: MLX CPU backend requires libblas-dev, liblapack-dev, and liblapacke-dev on Linux. Added to apt-get install. 2. g++ wrapper ordering: The -fpermissive wrapper for GCC 14 was installed AFTER uv sync, but MLX may compile extensions during install. Moved the wrapper BEFORE uv sync so both build-time and runtime JIT compilation benefit from the fix. MLX publishes manylinux_2_35_x86_64 wheels, so this uses the native CPU backend — no alternative inference framework needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add proactive monitoring to detect runner process death and unresponsiveness: - Health check loop polls is_alive() every 1s, detects unexpected exits - Counter-based heartbeat detects frozen/unresponsive processes - Emits RunnerFailed event and releases pending task waiters on failure - Add EXO_RUNNER_MUST_DIE debug trigger for testing abrupt process death - Add chaos E2E test that kills runner mid-inference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lection Add root conftest.py to exclude tests/start_distributed_test.py from pytest collection (it calls sys.exit at module level). Fix ruff lint issues (import sorting, f-string without placeholders, lambda loop variable capture) and apply nix fmt formatting to e2e files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snapshot tests do MLX inference on x86 CPU in Docker which takes >600s per test, causing the 45-minute CI job to timeout. Only cluster_formation and no_internet (non-inference tests) should run in CI. Inference snapshot tests can be run locally with --slow or E2E_SLOW=1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Scope e2e workflow to only trigger on pushes to e2e-tests branch (not every branch push) - Add temperature=0 to remaining snapshot test chat calls for deterministic output - Make assert_snapshot fail when no baseline exists instead of silently creating one — baselines must be explicitly generated with UPDATE_SNAPSHOTS=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker mDNS discovery can be slow on first boot in CI, causing cluster_formation to timeout on "Nodes discovered each other" while subsequent tests pass fine. Retry failed tests once before counting them as real failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After merging main (api cancellation #1276), the RunnerSupervisor dataclass requires a _cancel_sender field. Update the test helper to create and pass this channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 17 E2E chaos tests across 6 test modules exercising the coordination layer without Docker, networking, or GPU dependencies: - Networking resilience: disconnect/reconnect, node timeout, concurrent writers - Failure recovery: master crash/re-election, runner failure, rapid node joins - Client disconnect: task cancellation, rapid cancel/no stuck tasks - Node join/leave: dynamic registration, removal cleanup, join/leave churn - Distributed model loading: multi-node sharding, single-node, 3-node sharding - Concurrent requests: no corruption, multi-model routing, monotonic indexing Uses MiniCluster harness wiring Master + Workers via in-process channels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
891166a to
23d5b33
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
17 new tests covering networking resilience, failure recovery, client disconnect handling, node join/leave, distributed loading, and concurrent requests.