LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4
Open
Johan-de-R wants to merge 9 commits into
Open
LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4Johan-de-R wants to merge 9 commits into
Johan-de-R wants to merge 9 commits into
Conversation
…, benchmark tooling - Adds LTX-2 video generation served via Dynamo's diffusers worker. - Pins FastVideo upstream SHA in Dockerfile for reproducible builds. - Re-adds enable_nats=False to DistributedRuntime ctor (lost in upstream rebase). - Adds --served-model-name flag so the worker can advertise under a customer-facing name decoupled from the weights path. - Adds benchmark.py flags --shuffle-seed and --num-prompts for validation runs. Includes two superseded torch._dynamo.reset() attempts (between-shape and in-preflight) that proved insufficient for the cache order-dependence problem; the working fix is the subprocess isolation approach landed in the follow-up Phase 1 commit. See deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md for the journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torch.compile / inductor / Triton cache keys fold in accumulated in-process state, not just shape parameters. Compiling shapes A → B → C in different orders produces different on-disk keys, so a single-process warmup builds a cache that production can't reliably hit. Fix: each shape is warmed in a fresh Python subprocess with dedicated per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR, so cache keys are deterministic. Also switches the warmup model from the HF repo id to the local /data/default weights path to avoid network fetches. Validated end-to-end: a fresh subprocess pointed at /cache/per-shape/<shape>/ hits cleanly with zero new writes; 3.5-6x speedup vs cold compile. See deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Phase 2) Routes each customer request to a persistent shape-pinned subprocess that consumes the Phase 1 per-shape compile cache. K=2 LRU pool on B200 (sized to fit two FastVideo Worker children at ~71 GiB each; not the ~25-35 GiB originally estimated from the model card). Opt-in via LTX2_POOL_MODE=1; default-off keeps the legacy in-process generator unchanged for the soft-launch period. IPC uses multiprocessing.Connection on a dedicated fd not derived from stdout/stderr. An earlier text-line protocol on stdout produced two whack-a-mole desync bugs in one day due to FastVideo's vllm-style logger and its multiproc_executor Worker child both writing log lines to stdout post-READY. Pickled-dict messages on a private fd eliminate that failure class by construction. Validation: 7-step end-to-end pass on di-slc-39 (B200, K=2). 6 PASS + 1 documented anomaly (FastVideo's multiproc_executor terminates the outer pool subprocess as a side effect of Worker-child crashes on its own validation errors; system recovers via respawn at ~3 min cold-compile cost — soft-DoS surface for which followup mitigations are listed in the plan doc). See deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md and examples/diffusers/ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n decisions
RUNBOOK.md is operational ("how do I add a shape, build an image,
roll back"). ARCHITECTURE.md is the architectural companion --
"why is the code shaped this way." A future maintainer (human or
AI) opening worker.py and wondering why the IPC plumbing exists,
why subprocess isolation is needed, or why K=2 on B200 instead of
something larger will find the answers here.
Organized by topic (subprocess isolation, persistent pool, K sizing,
LRU eviction, Connection-based IPC, opt-in via LTX2_POOL_MODE,
operational costs), with a synthesized "what we tried that was
wrong" section pulling lessons from Phase 1 and Phase 2.
Evergreen (no dates in body, won't go stale from continued work),
scannable (headings, short sections), and cross-references the
dated plan documents at deepinfra/backend/claude_plans/ for
historical narrative.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if drainer is not None: | ||
| drainer.cancel() | ||
| with suppress(asyncio.CancelledError, Exception): | ||
| await drainer |
…t disconnect
Two robustness gaps caught in code review of the Phase 2 worker:
1. SubprocessPool._spawn's post-Popen block only handled the three
"expected" recv exceptions ({asyncio.TimeoutError, EOFError, OSError,
ConnectionError}), then a separate isinstance/kind check raised
RuntimeError. Anything else from _recv_msg (e.g., pickle.UnpicklingError
on corrupt wire data, or any unanticipated error from asyncio.to_thread
or task creation) would propagate without tearing down the subprocess,
the drainer tasks, or the parent end of the protocol Pipe. fd leak +
orphaned subprocess.
Restructured so all post-Popen work runs inside a single outer try
with `except BaseException: await self._kill(handle); raise`. Each
inner branch still raises RuntimeError with diag tail; outer guard
makes cleanup unconditional. BaseException catches asyncio.CancelledError
and KeyboardInterrupt too, then re-raises the original error type so
callers see what they expected.
2. _pool_worker_main exited with an uncaught BrokenPipeError /
ConnectionResetError if the parent died mid-protocol -- e.g.,
between READY and the first REQUEST, or during a DONE/ERROR send
after generate_video completed but before the response landed. Now
wrapped in an outer try that catches BrokenPipeError /
ConnectionResetError / OSError, logs to stderr, closes conn, and
exits 0. The CUDA-fault FATAL send is also wrapped in
suppress(Exception) since the parent may have already gone away
between the fault and our attempt to report it.
The parent already detects subprocess death via the stderr drainer
and recv EOF; the subprocess exit code (0 vs 1) is incidental but
"clean exit on parent disconnect" reads better in logs than "Python
traceback nobody reads".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small surgical fixes flagged by the CodeQL scan on PR #4. None change behavior; each just narrows what the static analyzer has to prove. - worker.py (_kill): swap the bare `try: proc.terminate() except ProcessLookupError: pass` for `with suppress(ProcessLookupError): proc.terminate()`. Matches the style we already use for the SIGKILL fallback in the same function. - warmup.py (_du_bytes): same swap; the file may legitimately vanish between os.walk listing it and stat reading it, so suppress(OSError) is exactly the intent. Comment makes the why explicit. Adds `import contextlib`. - benchmark.py: wrap the csv-writer block in a `with open(...) as csv_file:` so the file handle is guaranteed closed even if the generation loop raises. Removes the explicit `csv_file.close()` at the bottom and indents the loop body. - test_warmup_shapes.py: pre-declare `_compute_menu_hash = None` before the try/import. pytest.skip() raises so the call site is unreachable on the ImportError path, but CodeQL doesn't model pytest's behavior. One-line shim shuts the warning up without changing test semantics. The two false-positive findings (worker.py _kill drainer-await flagged as "no effect") will be dismissed in the GitHub UI after this push triggers the re-scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, requests, latencies, failures)
Exposes seven ltx2_pool_* series via a dedicated CollectorRegistry that
is wired into Dynamo's existing /metrics endpoint through
register_engine_metrics_callback. The callback registers
unconditionally so series exist (zero-valued) even when
LTX2_POOL_MODE=0, which keeps dashboards stable across the rollout
toggle. Dynamo auto-injects dynamo_namespace, dynamo_component,
dynamo_endpoint, worker_id, and model labels so these series carry
the same hierarchy labels as vLLM/SGLang worker metrics.
Metric → operational question:
ltx2_pool_size — current live subprocesses (gauge for capacity tracking)
ltx2_pool_spawn_total{shape_key} — spawn rate per shape (churn signal)
ltx2_pool_eviction_total{shape_key} — LRU eviction rate (pool-too-small signal)
ltx2_pool_request_total{shape_key,status} — DONE/ERROR/FATAL counts
ltx2_pool_request_latency_seconds{shape_key} — end-to-end latency, DONE path
ltx2_pool_cold_spawn_seconds{shape_key} — fork → READY (cold-start budget)
ltx2_pool_subprocess_failure_total{reason} — failure causes
(cuda_fault, spawn_timeout,
spawn_eof, spawn_parse_error,
gen_timeout, gen_eof, desync,
parse_error, send_failed)
Pool subprocesses do not have a Dynamo endpoint and therefore do not
register a callback; all accounting happens in the parent. The
FATAL-response branch in route() is the single point that records
cuda_fault, so subprocess-side state changes are observed exactly
once.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sh orphan window A pool subprocess loads FastVideo's multiproc_executor, which spawns a Worker child that pins ~70 GiB of GPU memory for the model. If the parent worker process dies mid-spawn (or mid-generation), the subprocess and its descendants are reparented to PID 1 and continue holding the GPU until container teardown -- on k8s, that's the liveness-probe-driven pod restart window, during which the node is effectively wedged. prctl(PR_SET_PDEATHSIG, SIGTERM) asks the kernel to deliver SIGTERM to the subprocess at the moment its parent thread dies. The existing SIGTERM handler closes the protocol pipe and calls sys.exit(0), which propagates SIGTERM to the FastVideo Worker child via normal process-group teardown, freeing the GPU before the kernel can decide the node needs intervention. Linux-only; failure (e.g. running on macOS in CI, or libc.so.6 absent) is logged to stderr and the subprocess keeps running -- the guard is defense-in-depth, not load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… RUNBOOK Pool metrics flow through Dynamo's worker-side system_status_server, which is disabled by default (DYN_SYSTEM_PORT=-1). Without explicit deployment config, our registered callback is silently dormant and operators see zero pool visibility -- the frontend's /metrics on the HTTP port does not aggregate worker hierarchies. Validated this gap in the Phase 2 e2e smoke; the misconfiguration would otherwise have shipped to production unnoticed. Two-line guard in backend_worker emits a WARNING at startup when DYN_SYSTEM_PORT is unset or non-numeric (defensive int parse), and an INFO line confirming the port when it's enabled. The misconfiguration now surfaces in pod startup logs before any traffic flows. RUNBOOK gets a new ## Metrics section after ## Known limits naming the seven ltx2_pool_* series, the auto-injected Dynamo hierarchy labels, the required deployment env (DYN_SYSTEM_PORT, DYN_SYSTEM_HOST, k8s containerPort, scrape config), and four recommended Prometheus alert expressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships an opt-in subprocess pool that routes each customer request to a
persistent shape-pinned Python subprocess. The pool consumes the per-shape
compile-cache layout that Phase 1 produced under
/cache/per-shape/<shape_key>/{torchinductor,triton}, amortizing the~3 minute cold-start cost (Python boot + torch import + model load +
cache hydrate + first-generate compile retrace) across all subsequent
requests for the same shape. Pool mode is opt-in via
LTX2_POOL_MODE=1;default-off keeps the legacy in-process generator unchanged for the
soft-launch period.
The architectural decisions and lessons learned across the implementation
are captured in
examples/diffusers/ARCHITECTURE.md.What's in this PR
Four commits, squashed from the original 15-commit development history
to keep review focused. Each commit is a coherent reviewable unit; the
full journey (including superseded attempts) is preserved in the
companion
deepinfra/backendPR's plan documents.bb8f89ff796cbf11ac7a202c596ca1eab016Total: 13 files changed, +3361 / −90. All under
examples/diffusers/.Commit 1 (Setup) adds the LTX-2 pipeline scaffolding: Dynamo
diffusers worker, FastVideo pin,
--served-model-nameflag, benchmarktooling, RUNBOOK. Also folds in two superseded
torch._dynamo.reset()attempts that proved insufficient for the cache order-dependence
problem; full journey in
deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md.Commit 2 (Phase 1) introduces subprocess-per-shape warmup with
per-shape
TORCHINDUCTOR_CACHE_DIRandTRITON_CACHE_DIR. The fixfor the order-dependent compile cache identified in Setup. Full
narrative in
deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md.Commit 3 (Phase 2) is the headline change: opt-in K=2 LRU subprocess
pool with
multiprocessing.ConnectionIPC. Includes the four rounds ofimplementation evolution (initial pool, fd-isolation attempt,
Connection redesign, K=5→K=2 correction) folded together. Full narrative
including what was tried and rejected in
deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.Commit 4 (ARCHITECTURE.md) is the evergreen design-rationale
document that lives next to the code. Explains why (subprocess
isolation, persistent pool, K=2, LRU, Connection IPC, opt-in flag,
operational costs) so a future maintainer reading
worker.pydoesn'thave to dig through plan docs to understand the constraints.
Why subprocess isolation
torch.compile/ inductor / Triton cache keys fold in in-processaccumulated state, not just shape parameters. A single Python interpreter
compiling shapes A → B → C produces different on-disk keys than B → A → C,
and
torch._dynamo.reset()does not clear the Triton autotuner / CUDAmodule / other kernel state that the key incorporates. Subprocess-per-shape
with dedicated per-shape
TORCHINDUCTOR_CACHE_DIRandTRITON_CACHE_DIRmakes keys deterministic and lets a fresh production worker hit the cache
baked by warmup. Full explanation in
ARCHITECTURE.md§ "Why subprocessisolation per shape".
Opt-in via
LTX2_POOL_MODEThis PR ships pool code dormant. The legacy in-process path is
default-on; flipping the flag is a separate model-config change. The two
paths share
_load_generator(model_name, num_gpus, enable_optimizations)so cache keys match byte-for-byte. After soft launch stabilizes, a
follow-up commit flips the default and deletes the legacy path.
Test plan
Validated end-to-end on di-slc-39 (B200, GPU
GPU-5f3391b7…d59d) throughthe full 7-step pool validation specified in
backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.Post-squash Step 1 re-run produced a byte-identical MP4 (381,322 bytes)
to the pre-squash round-4 validation, confirming zero behavioral drift.
768x512@121fcold spawnmultiproc_executorterminates the outer pool subprocess as a side effect of its Worker child crashing onStageVerificationError. System recovers via respawn at ~3 min cold-compile cost. See "Risk" below.kill -9on pool subprocess)_get_or_spawncleaned up + respawned cleanly, READY 25.9s, first-request 188s, HTTP 200 wall 214s. Valid MP4.LTX2_POOL_MODEunset)pool_mode=False, 0 spawns, server-side generation completed in 446s. The 446s vs 200s pool-mode delta on the same shape reconfirms the Phase 1 cache contract value.Reproducibility: all per-step request payloads, response files, and
worker.log slices preserved on di-slc-39 under
/tmp/ltx2-phase2-validate/(rounds 2-4 logs atlogs.round{2,3,4}/,post-squash re-run at
logs/).Risk
Soft-DoS surface (from Step 5 finding). A customer (intentional or
otherwise) sending malformed requests can keep one pool slot in permanent
respawn at ~3 minutes per bad request. FastVideo's
multiproc_executordoes not survive its own
StageVerificationError(or similar Worker-childcrashes), so the pool subprocess dies even though our ERROR code path
handles the protocol correctly. K=2 means there is little headroom.
Mitigations on the followup list (below). For the soft-launch period
where traffic is small and operators can be paged on slot churn, the
risk is acceptable; harden before broad rollout.
Followups
Tracked in
backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.mdunder "What we explicitly defer":
FastVideoBackend.create_video. Rejectnum_inference_steps <= 0,num_frames <= 0,guidance_scale <= 0, etc. before pool route.Primary mitigation for the soft-DoS surface.
multiproc_executorbeconfigured to respawn its Worker child rather than terminate its
parent on Worker crash? If yes, drops cost-per-bad-request from ~3 min
to milliseconds.
on real traffic distribution data).
stabilizes; deletes legacy in-process path).
per-shape memory).
ai-dynamo1.0.2 withPhase 2
worker.py+ Connection IPC. Dockerfile inherits upstream's1.0.0 → 1.0.2 bump between fork-point and merge. Phase 2 validation
ran against the pre-bake image which uses the older
ai-dynamo;the next image bake will pull 1.0.2. Smoke-test for
DistributedRuntime/register_llmAPI drift before tagging thenew ship image.
Cross-ref
deepinfra/backend#2806--served-model-nameflag and adding the videogen startup probe:
deepinfra/backend#2805Reviewer checklist
_pool_worker_mainandSubprocessPoolIPC contractmatches
ARCHITECTURE.md§ "Whymultiprocessing.Connectionfor IPC".LTX2_POOL_MAX_SIZEdefault inworker.pyis2andthe comment correctly reflects ~71 GiB per pool subprocess on B200.
_load_generatoris shared byFastVideoBackend.initialize_model(legacy path) and
_pool_worker_main(pool path) so cache keys matchacross both.
_pool_worker_dispatch_if_requestedruns BEFORE theparent's logging/hash-check/Dynamo registration.
child_conn.close()happens in the parent's_spawnimmediately after
Popen, and thatparent_connis closed too onPopen failure (no fd leak).
pass_fds=(child_conn.fileno(),)is set on the spawninvocation and the child reconstructs
Connection(args.protocol_fd).