chore: fix kv-store browser tests hangs#22721
Open
mverzilli wants to merge 12 commits intomerge-train/fairiesfrom
Open
chore: fix kv-store browser tests hangs#22721mverzilli wants to merge 12 commits intomerge-train/fairiesfrom
mverzilli wants to merge 12 commits intomerge-train/fairiesfrom
Conversation
Temporary diagnostic wrapper around `yarn test:browser` that runs with a 90s timeout and captures system state (cgroup cpu.stat, memory, /tmp usage, process states) every 1s to /tmp/diag/probe.log, then dumps the tail to stderr on exit. Goal: diagnose the CI hang documented in PR #22693 where the kv-store browser tests hang silently around test file 5 only in the CI ISOLATE container, not reproducible from the same docker args locally. Revert once root-caused.
Upgrade the temporary CI diagnostic wrapper so the next hang captures
enough ground truth to identify the stall site.
- Dump full probe + stacks log on exit (v1 tail -400 truncated the
transition moment).
- Hybrid cadence: 1s steady, 0.1s burst when `yarn test:browser` stdout
has been silent for >3s, capped at ~20s dense sampling.
- Per-snapshot: /proc/net/{tcp,tcp6,unix} state counts.
- Per-burst stacks: lsof -i TCP -U, /proc/\$pid/syscall, per-thread wchan
histogram, socket/pipe/anon_inode FD symlinks.
- Widen pids_of_interest to node|vitest|esbuild|yarn so we see the full
vitest process tree, not just chromium.
Two small tweaks after observing a passing CI run (6e65957a4521a43d): - Capture stacks_snapshot every 10 steady iterations (~10s of wall time) so we have "healthy waiting" reference data to diff against burst-triggered snapshots during a future hang. Without this, a passing run produced zero in-flight stacks because the burst trigger never armed. - Silence the benign "No such file or directory" race when /proc/\$pid/ disappears mid-snapshot by moving the redirect around the read, not around the pipe (the pipe's output-side didn't inherit it).
Force a fixed 100k-iteration awk arithmetic loop under bash's `time`
keyword each snapshot; emit real/user/sys as a "microbench:" line.
The (user+sys)/real ratio disambiguates the two scenarios consistent
with the v1 hang data (flat usage_usec, threads on futex_wait_queue):
- scenario A (vitest/chromium IPC deadlock): no process in our cgroup
is demanding CPU, so microbench gets all it asks for -> ratio ~1.0
- scenario B (host-level preemption / noisy neighbor): microbench
demands CPU but host scheduler doesn't serve it -> ratio <<1.0
Fires every 5th steady iteration (~0.4% overhead during tests) and
every burst iteration (overhead irrelevant during a hang since tests
aren't running anyway).
…eadlock Vitest+chromium hit a CDP CLOSE_WAIT deadlock at test-file transitions under CPU-constrained environments (CI3 ISOLATE: --cpus=2). Vitest closes 6 of 10 CDP TCP connections in a teardown batch; chromium's network service can't drain them under contention; vitest's teardown blocks indefinitely on the close-handshake. Both processes end up with zero on-CPU threads (vitest on futex_wait_queue, chromium on ep_poll). Outer ci3 timeout fires after ~90s of silence, CI fails. Fix: run each test file in a separate vitest invocation, so CDP teardown only happens at process exit (where the OS reaps everything). Cost is ~5-10s of vite startup per file, ~60-100s extra per browser test run. Also adds: - scripts/repro-browser-hang.sh: reproduces the hang reliably under docker_isolate (--cpus=2 --memory=8g --tmpfs /tmp:exec,size=1g) for future regression checks. - scripts/probe-test-browser.sh: deep diagnostic wrapper with hybrid steady/burst sampling of /proc state, cgroup counters, and socket state. Used to root-cause this bug; kept for future debugging. - .test_patterns.yml: transfer kv-store TIMEOUT flake-entry ownership to martin (the original owner alex held it from a prior incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… martin Catch-all entry on yarn-project/kv-store with error_regex "vitest" sends any browser-shaped failure (anything emitting "vitest" in its output) to martin, so colleagues aren't blocked by residual flakiness while the new run-browser-tests.sh approach soaks in. Existing narrow entries still apply — owners are unioned across matching entries — so grego still gets pinged for his historical patterns. Intended to be removed after a few weeks of stable CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The merge brought in next's package.json with "test": "yarn test:node" (the historical disable workaround), overwriting the chained test set in the fix commit. package.local.json still has the correct override; this regenerates package.json from it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vitest.config.ts only includes src/sqlite-opfs/**/*.test.ts when VITE_SQLITE_OPFS=1 (gated since #22693 due to a separate hang). The wrapper script enumerated both directories unconditionally, so vitest received a positional path that the include filter excluded, returned "no test files found", and exited 1 — killing the loop via set -e. Mirror the gate at file-enumeration time so the script only feeds vitest paths that the config will accept.
…olation The VITE_SQLITE_OPFS gate was added in #22693 as a workaround for the 2-CPU CI hang. That hang was a vitest+chromium CDP teardown deadlock at test-file transitions, now fixed by per-file vitest invocation (see run-browser-tests.sh). The gating was always meant to be removed once the hang was root-caused. Drop the env-flag gate in vitest.config.ts and stop mirroring it in the wrapper script — vitest now discovers sqlite-opfs tests by default and the wrapper enumerates them in the same loop as the indexeddb files. The existing wide-net catch-all in .test_patterns.yml continues to quarantine any residual flakiness to martin so colleagues aren't blocked.
CI's bootstrap test_cmds machinery distributes each emitted command to a separate parallel slot. kv-store was previously emitted as a single line (`cd yarn-project/kv-store && yarn test`), which collapsed all 28 of its test files into one slot — leaving the rest of the executor idle while the package serialised through every file. Add kv-store/bootstrap.sh with a test_cmds function that emits one line per test file, mirroring the aztec/bootstrap.sh pattern. New kv-store/scripts/run_test.sh dispatches a single file to vitest (browser tests under indexeddb, sqlite-opfs) or mocha (everything else). Browser files keep ISOLATE=1; node files run unisolated like other packages. The local-dev `yarn test:browser` path (run-browser-tests.sh) now also goes through run_test.sh, so per-file invocation has a single source of truth for both CI and local reproduction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
yarn-project/kv-store/yarn test:browserruns ~11 vitest browser-mode test files in a single vitest process, all sharing one chromium browser. In CI3'sISOLATEenvironment (docker run --cpus=2 --memory=8g --tmpfs /tmp:exec,size=1g), the run would intermittently hang at a test-file boundary, sit silent for >80s, then get killed by the outer CI timeout. Roughly 25% hit rate on CI; zero hit rate on developer machines.The bug only manifests under CPU pressure. On unconstrained hardware (developer boxes with many cores), every chromium thread always gets scheduled in time and the bug never surfaces.
Root caused to a CDP teardown deadlock between vitest and chromium at test-file transitions:
headless_shell --type=utility --utility-sub-type=network.mojom.NetworkService(the chromium "network service" process) at startup.CLOSE_WAIT(peer closed, application hasn't calledclose()yet).--cpus=2, never gets scheduled to drain those closed FDs and callclose().Promise.alloverCDPSessiondetach), and never gets the response.vitestparked onfutex_wait_queue, chromium parked onep_poll-> deadlockThe hang is at upstream code (@vitest/browser-playwright ↔ playwright ↔ chromium CDP). The 6:4 close/open ratio reproduces exactly across runs.
The fix is to run each test file in a separate vitest invocation, instead of one vitest process iterating over all files.
yarn-project/kv-store/scripts/run-browser-tests.sh: new shell loop that finds each *.test.ts under src/indexeddb/ and src/sqlite-opfs/ and runs yarn vitest run "$file" per file, sequentially.yarn-project/kv-store/package.local.json: test:browser overridden to bash scripts/run-browser-tests.shThis avoids the bug entirely because the cross-file teardown path is never exercised: each vitest process only has to tear down at its own end-of-process, where chromium gets killed outright by the OS rather than asked to close gracefully via CDP.
Cost: ~5–10s of vite/yarn startup per file. With 11 files, that's ~60–100s of extra wall time per yarn test:browser run. Not great, but if we can stabilize the suite and keep it running, we can iteratively look for better ways (eg: reducing the amount of files to reduce overhead where sensible).
Verification
A reliable repro for future regressions is saved at yarn-project/kv-store/scripts/repro-browser-hang.sh. It runs the previous (single-process) test:browser shape under docker_isolate constraints; reproduces consistently (without this fix) in ~3 minutes.
Closes F-589