LTX-2 Phase 2: subprocess pool for per-shape persistent generators by Johan-de-R · Pull Request #4 · deepinfra/dynamo

Johan-de-R · 2026-05-15T01:03:49Z

Summary

Ships an opt-in subprocess pool that routes each customer request to a
persistent shape-pinned Python subprocess. The pool consumes the per-shape
compile-cache layout that Phase 1 produced under
/cache/per-shape/<shape_key>/{torchinductor,triton}, amortizing the
~3 minute cold-start cost (Python boot + torch import + model load +
cache hydrate + first-generate compile retrace) across all subsequent
requests for the same shape. Pool mode is opt-in via LTX2_POOL_MODE=1;
default-off keeps the legacy in-process generator unchanged for the
soft-launch period.

The architectural decisions and lessons learned across the implementation
are captured in examples/diffusers/ARCHITECTURE.md.

What's in this PR

Four commits, squashed from the original 15-commit development history
to keep review focused. Each commit is a coherent reviewable unit; the
full journey (including superseded attempts) is preserved in the
companion deepinfra/backend PR's plan documents.

#	SHA	Subject	Files	+/−
1	`bb8f89ff7`	LTX-2 video pipeline setup: initial Dynamo integration, FastVideo pin, benchmark tooling	12	+2180 / −39
2	`96cbf11ac`	warmup: per-shape isolated cache dirs via subprocess driver (Phase 1)	2	+123 / −58
3	`7a202c596`	worker: opt-in subprocess pool with per-shape persistent generators (Phase 2)	1	+842 / −75
4	`ca1eab016`	examples/diffusers: add ARCHITECTURE.md explaining LTX-2 worker design decisions	1	+298

Total: 13 files changed, +3361 / −90. All under examples/diffusers/.

Commit 1 (Setup) adds the LTX-2 pipeline scaffolding: Dynamo
diffusers worker, FastVideo pin, --served-model-name flag, benchmark
tooling, RUNBOOK. Also folds in two superseded torch._dynamo.reset()
attempts that proved insufficient for the cache order-dependence
problem; full journey in
deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md.

Commit 2 (Phase 1) introduces subprocess-per-shape warmup with
per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR. The fix
for the order-dependent compile cache identified in Setup. Full
narrative in
deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md.

Commit 3 (Phase 2) is the headline change: opt-in K=2 LRU subprocess
pool with multiprocessing.Connection IPC. Includes the four rounds of
implementation evolution (initial pool, fd-isolation attempt,
Connection redesign, K=5→K=2 correction) folded together. Full narrative
including what was tried and rejected in
deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.

Commit 4 (ARCHITECTURE.md) is the evergreen design-rationale
document that lives next to the code. Explains why (subprocess
isolation, persistent pool, K=2, LRU, Connection IPC, opt-in flag,
operational costs) so a future maintainer reading worker.py doesn't
have to dig through plan docs to understand the constraints.

Why subprocess isolation

torch.compile / inductor / Triton cache keys fold in in-process
accumulated state, not just shape parameters. A single Python interpreter
compiling shapes A → B → C produces different on-disk keys than B → A → C,
and torch._dynamo.reset() does not clear the Triton autotuner / CUDA
module / other kernel state that the key incorporates. Subprocess-per-shape
with dedicated per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR
makes keys deterministic and lets a fresh production worker hit the cache
baked by warmup. Full explanation in ARCHITECTURE.md § "Why subprocess
isolation per shape".

Opt-in via `LTX2_POOL_MODE`

This PR ships pool code dormant. The legacy in-process path is
default-on; flipping the flag is a separate model-config change. The two
paths share _load_generator(model_name, num_gpus, enable_optimizations)
so cache keys match byte-for-byte. After soft launch stabilizes, a
follow-up commit flips the default and deletes the legacy path.

Test plan

Validated end-to-end on di-slc-39 (B200, GPU GPU-5f3391b7…d59d) through
the full 7-step pool validation specified in
backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.
Post-squash Step 1 re-run produced a byte-identical MP4 (381,322 bytes)
to the pre-squash round-4 validation, confirming zero behavioral drift.

Step	Description	Verdict	Notes
1	Smoke test, single shape `768x512@121f` cold spawn	PASS	HTTP 200, 206s wall, MP4 381KB; per-shape cache hit confirmed (READY 24.4s).
2	Persistence — same shape re-request	PASS	7s wall, 0 respawns. Subprocess reused.
3	Multi-shape, fill pool to K=2	PASS	New shape cold spawn 201s, pool 1→2, 0 evictions.
4	LRU eviction at K=2	PASS	Third distinct shape triggered eviction of LRU (idle 246s, correct choice). 1 spawn, 1 eviction, 0 SIGKILL. Pool stays at 2.
5	Soft-error stays alive	ANOMALY (documented)	Pool's ERROR code path fires correctly; FastVideo's `multiproc_executor` terminates the outer pool subprocess as a side effect of its Worker child crashing on `StageVerificationError`. System recovers via respawn at ~3 min cold-compile cost. See "Risk" below.
6	Hard-error respawn (`kill -9` on pool subprocess)	PASS	Parent detected dead handle via SIGCHLD, `_get_or_spawn` cleaned up + respawned cleanly, READY 25.9s, first-request 188s, HTTP 200 wall 214s. Valid MP4.
7	Legacy path (`LTX2_POOL_MODE` unset)	PASS	`pool_mode=False`, 0 spawns, server-side generation completed in 446s. The 446s vs 200s pool-mode delta on the same shape reconfirms the Phase 1 cache contract value.

Reproducibility: all per-step request payloads, response files, and
worker.log slices preserved on di-slc-39 under
/tmp/ltx2-phase2-validate/ (rounds 2-4 logs at logs.round{2,3,4}/,
post-squash re-run at logs/).

Risk

Soft-DoS surface (from Step 5 finding). A customer (intentional or
otherwise) sending malformed requests can keep one pool slot in permanent
respawn at ~3 minutes per bad request. FastVideo's multiproc_executor
does not survive its own StageVerificationError (or similar Worker-child
crashes), so the pool subprocess dies even though our ERROR code path
handles the protocol correctly. K=2 means there is little headroom.

Mitigations on the followup list (below). For the soft-launch period
where traffic is small and operators can be paged on slot churn, the
risk is acceptable; harden before broad rollout.

Followups

Tracked in backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md
under "What we explicitly defer":

Parent-side input parameter validation in
FastVideoBackend.create_video. Reject num_inference_steps <= 0,
num_frames <= 0, guidance_scale <= 0, etc. before pool route.
Primary mitigation for the soft-DoS surface.
FastVideo upstream investigation. Can multiproc_executor be
configured to respawn its Worker child rather than terminate its
parent on Worker crash? If yes, drops cost-per-bad-request from ~3 min
to milliseconds.
Eager-spawn top-N shapes by observed traffic (post-launch, gated
on real traffic distribution data).
Traffic-data-informed menu pruning (post-launch).
Default-on pool mode (follow-up commit after soft launch
stabilizes; deletes legacy in-process path).
Prometheus metrics (pool_size, spawn_count, eviction_count,
per-shape memory).
Pre-bake compatibility check: validate ai-dynamo 1.0.2 with
Phase 2 worker.py + Connection IPC. Dockerfile inherits upstream's
1.0.0 → 1.0.2 bump between fork-point and merge. Phase 2 validation
ran against the pre-bake image which uses the older ai-dynamo;
the next image bake will pull 1.0.2. Smoke-test for
DistributedRuntime / register_llm API drift before tagging the
new ship image.

Cross-ref

Companion PR with the historical plan-doc narrative: deepinfra/backend#2806
Companion PR with the model-manager plumbing the dynamo --served-model-name
flag and adding the videogen startup probe: deepinfra/backend#2805

Reviewer checklist

Confirm _pool_worker_main and SubprocessPool IPC contract
matches ARCHITECTURE.md § "Why multiprocessing.Connection for IPC".
Confirm LTX2_POOL_MAX_SIZE default in worker.py is 2 and
the comment correctly reflects ~71 GiB per pool subprocess on B200.
Confirm _load_generator is shared by FastVideoBackend.initialize_model
(legacy path) and _pool_worker_main (pool path) so cache keys match
across both.
Confirm _pool_worker_dispatch_if_requested runs BEFORE the
parent's logging/hash-check/Dynamo registration.
Confirm child_conn.close() happens in the parent's _spawn
immediately after Popen, and that parent_conn is closed too on
Popen failure (no fd leak).
Confirm pass_fds=(child_conn.fileno(),) is set on the spawn
invocation and the child reconstructs Connection(args.protocol_fd).

…, benchmark tooling - Adds LTX-2 video generation served via Dynamo's diffusers worker. - Pins FastVideo upstream SHA in Dockerfile for reproducible builds. - Re-adds enable_nats=False to DistributedRuntime ctor (lost in upstream rebase). - Adds --served-model-name flag so the worker can advertise under a customer-facing name decoupled from the weights path. - Adds benchmark.py flags --shuffle-seed and --num-prompts for validation runs. Includes two superseded torch._dynamo.reset() attempts (between-shape and in-preflight) that proved insufficient for the cache order-dependence problem; the working fix is the subprocess isolation approach landed in the follow-up Phase 1 commit. See deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md for the journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

torch.compile / inductor / Triton cache keys fold in accumulated in-process state, not just shape parameters. Compiling shapes A → B → C in different orders produces different on-disk keys, so a single-process warmup builds a cache that production can't reliably hit. Fix: each shape is warmed in a fresh Python subprocess with dedicated per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR, so cache keys are deterministic. Also switches the warmup model from the HF repo id to the local /data/default weights path to avoid network fetches. Validated end-to-end: a fresh subprocess pointed at /cache/per-shape/<shape>/ hits cleanly with zero new writes; 3.5-6x speedup vs cold compile. See deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Phase 2) Routes each customer request to a persistent shape-pinned subprocess that consumes the Phase 1 per-shape compile cache. K=2 LRU pool on B200 (sized to fit two FastVideo Worker children at ~71 GiB each; not the ~25-35 GiB originally estimated from the model card). Opt-in via LTX2_POOL_MODE=1; default-off keeps the legacy in-process generator unchanged for the soft-launch period. IPC uses multiprocessing.Connection on a dedicated fd not derived from stdout/stderr. An earlier text-line protocol on stdout produced two whack-a-mole desync bugs in one day due to FastVideo's vllm-style logger and its multiproc_executor Worker child both writing log lines to stdout post-READY. Pickled-dict messages on a private fd eliminate that failure class by construction. Validation: 7-step end-to-end pass on di-slc-39 (B200, K=2). 6 PASS + 1 documented anomaly (FastVideo's multiproc_executor terminates the outer pool subprocess as a side effect of Worker-child crashes on its own validation errors; system recovers via respawn at ~3 min cold-compile cost — soft-DoS surface for which followup mitigations are listed in the plan doc). See deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md and examples/diffusers/ARCHITECTURE.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n decisions RUNBOOK.md is operational ("how do I add a shape, build an image, roll back"). ARCHITECTURE.md is the architectural companion -- "why is the code shaped this way." A future maintainer (human or AI) opening worker.py and wondering why the IPC plumbing exists, why subprocess isolation is needed, or why K=2 on B200 instead of something larger will find the answers here. Organized by topic (subprocess isolation, persistent pool, K sizing, LRU eviction, Connection-based IPC, opt-in via LTX2_POOL_MODE, operational costs), with a synthesized "what we tried that was wrong" section pulling lessons from Phase 1 and Phase 2. Evergreen (no dates in body, won't go stale from continued work), scannable (headings, short sections), and cross-references the dated plan documents at deepinfra/backend/claude_plans/ for historical narrative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

+            if drainer is not None:
+                drainer.cancel()
+                with suppress(asyncio.CancelledError, Exception):
+                    await drainer


…t disconnect Two robustness gaps caught in code review of the Phase 2 worker: 1. SubprocessPool._spawn's post-Popen block only handled the three "expected" recv exceptions ({asyncio.TimeoutError, EOFError, OSError, ConnectionError}), then a separate isinstance/kind check raised RuntimeError. Anything else from _recv_msg (e.g., pickle.UnpicklingError on corrupt wire data, or any unanticipated error from asyncio.to_thread or task creation) would propagate without tearing down the subprocess, the drainer tasks, or the parent end of the protocol Pipe. fd leak + orphaned subprocess. Restructured so all post-Popen work runs inside a single outer try with `except BaseException: await self._kill(handle); raise`. Each inner branch still raises RuntimeError with diag tail; outer guard makes cleanup unconditional. BaseException catches asyncio.CancelledError and KeyboardInterrupt too, then re-raises the original error type so callers see what they expected. 2. _pool_worker_main exited with an uncaught BrokenPipeError / ConnectionResetError if the parent died mid-protocol -- e.g., between READY and the first REQUEST, or during a DONE/ERROR send after generate_video completed but before the response landed. Now wrapped in an outer try that catches BrokenPipeError / ConnectionResetError / OSError, logs to stderr, closes conn, and exits 0. The CUDA-fault FATAL send is also wrapped in suppress(Exception) since the parent may have already gone away between the fault and our attempt to report it. The parent already detects subprocess death via the stderr drainer and recv EOF; the subprocess exit code (0 vs 1) is incidental but "clean exit on parent disconnect" reads better in logs than "Python traceback nobody reads". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four small surgical fixes flagged by the CodeQL scan on PR #4. None change behavior; each just narrows what the static analyzer has to prove. - worker.py (_kill): swap the bare `try: proc.terminate() except ProcessLookupError: pass` for `with suppress(ProcessLookupError): proc.terminate()`. Matches the style we already use for the SIGKILL fallback in the same function. - warmup.py (_du_bytes): same swap; the file may legitimately vanish between os.walk listing it and stat reading it, so suppress(OSError) is exactly the intent. Comment makes the why explicit. Adds `import contextlib`. - benchmark.py: wrap the csv-writer block in a `with open(...) as csv_file:` so the file handle is guaranteed closed even if the generation loop raises. Removes the explicit `csv_file.close()` at the bottom and indents the loop body. - test_warmup_shapes.py: pre-declare `_compute_menu_hash = None` before the try/import. pytest.skip() raises so the call site is unreachable on the ImportError path, but CodeQL doesn't model pytest's behavior. One-line shim shuts the warning up without changing test semantics. The two false-positive findings (worker.py _kill drainer-await flagged as "no effect") will be dismissed in the GitHub UI after this push triggers the re-scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s, requests, latencies, failures) Exposes seven ltx2_pool_* series via a dedicated CollectorRegistry that is wired into Dynamo's existing /metrics endpoint through register_engine_metrics_callback. The callback registers unconditionally so series exist (zero-valued) even when LTX2_POOL_MODE=0, which keeps dashboards stable across the rollout toggle. Dynamo auto-injects dynamo_namespace, dynamo_component, dynamo_endpoint, worker_id, and model labels so these series carry the same hierarchy labels as vLLM/SGLang worker metrics. Metric → operational question: ltx2_pool_size — current live subprocesses (gauge for capacity tracking) ltx2_pool_spawn_total{shape_key} — spawn rate per shape (churn signal) ltx2_pool_eviction_total{shape_key} — LRU eviction rate (pool-too-small signal) ltx2_pool_request_total{shape_key,status} — DONE/ERROR/FATAL counts ltx2_pool_request_latency_seconds{shape_key} — end-to-end latency, DONE path ltx2_pool_cold_spawn_seconds{shape_key} — fork → READY (cold-start budget) ltx2_pool_subprocess_failure_total{reason} — failure causes (cuda_fault, spawn_timeout, spawn_eof, spawn_parse_error, gen_timeout, gen_eof, desync, parse_error, send_failed) Pool subprocesses do not have a Dynamo endpoint and therefore do not register a callback; all accounting happens in the parent. The FATAL-response branch in route() is the single point that records cuda_fault, so subprocess-side state changes are observed exactly once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sh orphan window A pool subprocess loads FastVideo's multiproc_executor, which spawns a Worker child that pins ~70 GiB of GPU memory for the model. If the parent worker process dies mid-spawn (or mid-generation), the subprocess and its descendants are reparented to PID 1 and continue holding the GPU until container teardown -- on k8s, that's the liveness-probe-driven pod restart window, during which the node is effectively wedged. prctl(PR_SET_PDEATHSIG, SIGTERM) asks the kernel to deliver SIGTERM to the subprocess at the moment its parent thread dies. The existing SIGTERM handler closes the protocol pipe and calls sys.exit(0), which propagates SIGTERM to the FastVideo Worker child via normal process-group teardown, freeing the GPU before the kernel can decide the node needs intervention. Linux-only; failure (e.g. running on macOS in CI, or libc.so.6 absent) is logged to stderr and the subprocess keeps running -- the guard is defense-in-depth, not load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… RUNBOOK Pool metrics flow through Dynamo's worker-side system_status_server, which is disabled by default (DYN_SYSTEM_PORT=-1). Without explicit deployment config, our registered callback is silently dormant and operators see zero pool visibility -- the frontend's /metrics on the HTTP port does not aggregate worker hierarchies. Validated this gap in the Phase 2 e2e smoke; the misconfiguration would otherwise have shipped to production unnoticed. Two-line guard in backend_worker emits a WARNING at startup when DYN_SYSTEM_PORT is unset or non-numeric (defensive int parse), and an INFO line confirming the port when it's enabled. The misconfiguration now surfaces in pod startup logs before any traffic flows. RUNBOOK gets a new ## Metrics section after ## Known limits naming the seven ltx2_pool_* series, the auto-injected Dynamo hierarchy labels, the required deployment env (DYN_SYSTEM_PORT, DYN_SYSTEM_HOST, k8s containerPort, scrape config), and four recommended Prometheus alert expressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude and others added 4 commits May 15, 2026 00:13

github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

Johan-de-R and others added 5 commits May 15, 2026 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4

LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4
Johan-de-R wants to merge 9 commits into
mainfrom
claude/continue-work-tTuL8

Johan-de-R commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Johan-de-R commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Why subprocess isolation

Opt-in via LTX2_POOL_MODE

Test plan

Risk

Followups

Cross-ref

Reviewer checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Johan-de-R commented May 15, 2026 •

edited

Loading

Opt-in via `LTX2_POOL_MODE`