Skip to content

LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4

Open
Johan-de-R wants to merge 9 commits into
mainfrom
claude/continue-work-tTuL8
Open

LTX-2 Phase 2: subprocess pool for per-shape persistent generators#4
Johan-de-R wants to merge 9 commits into
mainfrom
claude/continue-work-tTuL8

Conversation

@Johan-de-R
Copy link
Copy Markdown

@Johan-de-R Johan-de-R commented May 15, 2026

Summary

Ships an opt-in subprocess pool that routes each customer request to a
persistent shape-pinned Python subprocess. The pool consumes the per-shape
compile-cache layout that Phase 1 produced under
/cache/per-shape/<shape_key>/{torchinductor,triton}, amortizing the
~3 minute cold-start cost (Python boot + torch import + model load +
cache hydrate + first-generate compile retrace) across all subsequent
requests for the same shape. Pool mode is opt-in via LTX2_POOL_MODE=1;
default-off keeps the legacy in-process generator unchanged for the
soft-launch period.

The architectural decisions and lessons learned across the implementation
are captured in examples/diffusers/ARCHITECTURE.md.

What's in this PR

Four commits, squashed from the original 15-commit development history
to keep review focused. Each commit is a coherent reviewable unit; the
full journey (including superseded attempts) is preserved in the
companion deepinfra/backend PR's plan documents.

# SHA Subject Files +/−
1 bb8f89ff7 LTX-2 video pipeline setup: initial Dynamo integration, FastVideo pin, benchmark tooling 12 +2180 / −39
2 96cbf11ac warmup: per-shape isolated cache dirs via subprocess driver (Phase 1) 2 +123 / −58
3 7a202c596 worker: opt-in subprocess pool with per-shape persistent generators (Phase 2) 1 +842 / −75
4 ca1eab016 examples/diffusers: add ARCHITECTURE.md explaining LTX-2 worker design decisions 1 +298

Total: 13 files changed, +3361 / −90. All under examples/diffusers/.

Commit 1 (Setup) adds the LTX-2 pipeline scaffolding: Dynamo
diffusers worker, FastVideo pin, --served-model-name flag, benchmark
tooling, RUNBOOK. Also folds in two superseded torch._dynamo.reset()
attempts that proved insufficient for the cache order-dependence
problem; full journey in
deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md.

Commit 2 (Phase 1) introduces subprocess-per-shape warmup with
per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR. The fix
for the order-dependent compile cache identified in Setup. Full
narrative in
deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md.

Commit 3 (Phase 2) is the headline change: opt-in K=2 LRU subprocess
pool with multiprocessing.Connection IPC. Includes the four rounds of
implementation evolution (initial pool, fd-isolation attempt,
Connection redesign, K=5→K=2 correction) folded together. Full narrative
including what was tried and rejected in
deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.

Commit 4 (ARCHITECTURE.md) is the evergreen design-rationale
document that lives next to the code. Explains why (subprocess
isolation, persistent pool, K=2, LRU, Connection IPC, opt-in flag,
operational costs) so a future maintainer reading worker.py doesn't
have to dig through plan docs to understand the constraints.

Why subprocess isolation

torch.compile / inductor / Triton cache keys fold in in-process
accumulated state, not just shape parameters. A single Python interpreter
compiling shapes A → B → C produces different on-disk keys than B → A → C,
and torch._dynamo.reset() does not clear the Triton autotuner / CUDA
module / other kernel state that the key incorporates. Subprocess-per-shape
with dedicated per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR
makes keys deterministic and lets a fresh production worker hit the cache
baked by warmup. Full explanation in ARCHITECTURE.md § "Why subprocess
isolation per shape".

Opt-in via LTX2_POOL_MODE

This PR ships pool code dormant. The legacy in-process path is
default-on; flipping the flag is a separate model-config change. The two
paths share _load_generator(model_name, num_gpus, enable_optimizations)
so cache keys match byte-for-byte. After soft launch stabilizes, a
follow-up commit flips the default and deletes the legacy path.

Test plan

Validated end-to-end on di-slc-39 (B200, GPU GPU-5f3391b7…d59d) through
the full 7-step pool validation specified in
backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md.
Post-squash Step 1 re-run produced a byte-identical MP4 (381,322 bytes)
to the pre-squash round-4 validation, confirming zero behavioral drift.

Step Description Verdict Notes
1 Smoke test, single shape 768x512@121f cold spawn PASS HTTP 200, 206s wall, MP4 381KB; per-shape cache hit confirmed (READY 24.4s).
2 Persistence — same shape re-request PASS 7s wall, 0 respawns. Subprocess reused.
3 Multi-shape, fill pool to K=2 PASS New shape cold spawn 201s, pool 1→2, 0 evictions.
4 LRU eviction at K=2 PASS Third distinct shape triggered eviction of LRU (idle 246s, correct choice). 1 spawn, 1 eviction, 0 SIGKILL. Pool stays at 2.
5 Soft-error stays alive ANOMALY (documented) Pool's ERROR code path fires correctly; FastVideo's multiproc_executor terminates the outer pool subprocess as a side effect of its Worker child crashing on StageVerificationError. System recovers via respawn at ~3 min cold-compile cost. See "Risk" below.
6 Hard-error respawn (kill -9 on pool subprocess) PASS Parent detected dead handle via SIGCHLD, _get_or_spawn cleaned up + respawned cleanly, READY 25.9s, first-request 188s, HTTP 200 wall 214s. Valid MP4.
7 Legacy path (LTX2_POOL_MODE unset) PASS pool_mode=False, 0 spawns, server-side generation completed in 446s. The 446s vs 200s pool-mode delta on the same shape reconfirms the Phase 1 cache contract value.

Reproducibility: all per-step request payloads, response files, and
worker.log slices preserved on di-slc-39 under
/tmp/ltx2-phase2-validate/ (rounds 2-4 logs at logs.round{2,3,4}/,
post-squash re-run at logs/).

Risk

Soft-DoS surface (from Step 5 finding). A customer (intentional or
otherwise) sending malformed requests can keep one pool slot in permanent
respawn at ~3 minutes per bad request. FastVideo's multiproc_executor
does not survive its own StageVerificationError (or similar Worker-child
crashes), so the pool subprocess dies even though our ERROR code path
handles the protocol correctly. K=2 means there is little headroom.

Mitigations on the followup list (below). For the soft-launch period
where traffic is small and operators can be paged on slot churn, the
risk is acceptable; harden before broad rollout.

Followups

Tracked in backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md
under "What we explicitly defer":

  • Parent-side input parameter validation in
    FastVideoBackend.create_video.
    Reject num_inference_steps <= 0,
    num_frames <= 0, guidance_scale <= 0, etc. before pool route.
    Primary mitigation for the soft-DoS surface.
  • FastVideo upstream investigation. Can multiproc_executor be
    configured to respawn its Worker child rather than terminate its
    parent on Worker crash? If yes, drops cost-per-bad-request from ~3 min
    to milliseconds.
  • Eager-spawn top-N shapes by observed traffic (post-launch, gated
    on real traffic distribution data).
  • Traffic-data-informed menu pruning (post-launch).
  • Default-on pool mode (follow-up commit after soft launch
    stabilizes; deletes legacy in-process path).
  • Prometheus metrics (pool_size, spawn_count, eviction_count,
    per-shape memory).
  • Pre-bake compatibility check: validate ai-dynamo 1.0.2 with
    Phase 2 worker.py + Connection IPC.
    Dockerfile inherits upstream's
    1.0.0 → 1.0.2 bump between fork-point and merge. Phase 2 validation
    ran against the pre-bake image which uses the older ai-dynamo;
    the next image bake will pull 1.0.2. Smoke-test for
    DistributedRuntime / register_llm API drift before tagging the
    new ship image.

Cross-ref

  • Companion PR with the historical plan-doc narrative: deepinfra/backend#2806
  • Companion PR with the model-manager plumbing the dynamo --served-model-name
    flag and adding the videogen startup probe: deepinfra/backend#2805

Reviewer checklist

  • Confirm _pool_worker_main and SubprocessPool IPC contract
    matches ARCHITECTURE.md § "Why multiprocessing.Connection for IPC".
  • Confirm LTX2_POOL_MAX_SIZE default in worker.py is 2 and
    the comment correctly reflects ~71 GiB per pool subprocess on B200.
  • Confirm _load_generator is shared by FastVideoBackend.initialize_model
    (legacy path) and _pool_worker_main (pool path) so cache keys match
    across both.
  • Confirm _pool_worker_dispatch_if_requested runs BEFORE the
    parent's logging/hash-check/Dynamo registration.
  • Confirm child_conn.close() happens in the parent's _spawn
    immediately after Popen, and that parent_conn is closed too on
    Popen failure (no fd leak).
  • Confirm pass_fds=(child_conn.fileno(),) is set on the spawn
    invocation and the child reconstructs Connection(args.protocol_fd).

claude and others added 4 commits May 15, 2026 00:13
…, benchmark tooling

- Adds LTX-2 video generation served via Dynamo's diffusers worker.
- Pins FastVideo upstream SHA in Dockerfile for reproducible builds.
- Re-adds enable_nats=False to DistributedRuntime ctor (lost in upstream rebase).
- Adds --served-model-name flag so the worker can advertise under a customer-facing name decoupled from the weights path.
- Adds benchmark.py flags --shuffle-seed and --num-prompts for validation runs.

Includes two superseded torch._dynamo.reset() attempts (between-shape and in-preflight) that proved insufficient for the cache order-dependence problem; the working fix is the subprocess isolation approach landed in the follow-up Phase 1 commit. See deepinfra/backend/claude_plans/2026-05-13-ltx2-cache-order-dependence.md for the journey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torch.compile / inductor / Triton cache keys fold in accumulated in-process state, not just shape parameters. Compiling shapes A → B → C in different orders produces different on-disk keys, so a single-process warmup builds a cache that production can't reliably hit.

Fix: each shape is warmed in a fresh Python subprocess with dedicated per-shape TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR, so cache keys are deterministic. Also switches the warmup model from the HF repo id to the local /data/default weights path to avoid network fetches.

Validated end-to-end: a fresh subprocess pointed at /cache/per-shape/<shape>/ hits cleanly with zero new writes; 3.5-6x speedup vs cold compile.

See deepinfra/backend/claude_plans/2026-05-14-ltx2-per-shape-cache.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Phase 2)

Routes each customer request to a persistent shape-pinned subprocess that consumes the Phase 1 per-shape compile cache. K=2 LRU pool on B200 (sized to fit two FastVideo Worker children at ~71 GiB each; not the ~25-35 GiB originally estimated from the model card). Opt-in via LTX2_POOL_MODE=1; default-off keeps the legacy in-process generator unchanged for the soft-launch period.

IPC uses multiprocessing.Connection on a dedicated fd not derived from stdout/stderr. An earlier text-line protocol on stdout produced two whack-a-mole desync bugs in one day due to FastVideo's vllm-style logger and its multiproc_executor Worker child both writing log lines to stdout post-READY. Pickled-dict messages on a private fd eliminate that failure class by construction.

Validation: 7-step end-to-end pass on di-slc-39 (B200, K=2). 6 PASS + 1 documented anomaly (FastVideo's multiproc_executor terminates the outer pool subprocess as a side effect of Worker-child crashes on its own validation errors; system recovers via respawn at ~3 min cold-compile cost — soft-DoS surface for which followup mitigations are listed in the plan doc).

See deepinfra/backend/claude_plans/2026-05-14-ltx2-phase2-subprocess-pool.md and examples/diffusers/ARCHITECTURE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n decisions

RUNBOOK.md is operational ("how do I add a shape, build an image,
roll back"). ARCHITECTURE.md is the architectural companion --
"why is the code shaped this way." A future maintainer (human or
AI) opening worker.py and wondering why the IPC plumbing exists,
why subprocess isolation is needed, or why K=2 on B200 instead of
something larger will find the answers here.

Organized by topic (subprocess isolation, persistent pool, K sizing,
LRU eviction, Connection-based IPC, opt-in via LTX2_POOL_MODE,
operational costs), with a synthesized "what we tried that was
wrong" section pulling lessons from Phase 1 and Phase 2.

Evergreen (no dates in body, won't go stale from continued work),
scannable (headings, short sections), and cross-references the
dated plan documents at deepinfra/backend/claude_plans/ for
historical narrative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026
Comment thread examples/diffusers/test_warmup_shapes.py Fixed
Comment thread examples/diffusers/benchmark.py Fixed
Comment thread examples/diffusers/warmup.py Fixed
Comment thread examples/diffusers/worker.py Fixed
if drainer is not None:
drainer.cancel()
with suppress(asyncio.CancelledError, Exception):
await drainer
Johan-de-R and others added 5 commits May 15, 2026 01:56
…t disconnect

Two robustness gaps caught in code review of the Phase 2 worker:

1. SubprocessPool._spawn's post-Popen block only handled the three
   "expected" recv exceptions ({asyncio.TimeoutError, EOFError, OSError,
   ConnectionError}), then a separate isinstance/kind check raised
   RuntimeError. Anything else from _recv_msg (e.g., pickle.UnpicklingError
   on corrupt wire data, or any unanticipated error from asyncio.to_thread
   or task creation) would propagate without tearing down the subprocess,
   the drainer tasks, or the parent end of the protocol Pipe. fd leak +
   orphaned subprocess.

   Restructured so all post-Popen work runs inside a single outer try
   with `except BaseException: await self._kill(handle); raise`. Each
   inner branch still raises RuntimeError with diag tail; outer guard
   makes cleanup unconditional. BaseException catches asyncio.CancelledError
   and KeyboardInterrupt too, then re-raises the original error type so
   callers see what they expected.

2. _pool_worker_main exited with an uncaught BrokenPipeError /
   ConnectionResetError if the parent died mid-protocol -- e.g.,
   between READY and the first REQUEST, or during a DONE/ERROR send
   after generate_video completed but before the response landed. Now
   wrapped in an outer try that catches BrokenPipeError /
   ConnectionResetError / OSError, logs to stderr, closes conn, and
   exits 0. The CUDA-fault FATAL send is also wrapped in
   suppress(Exception) since the parent may have already gone away
   between the fault and our attempt to report it.

   The parent already detects subprocess death via the stderr drainer
   and recv EOF; the subprocess exit code (0 vs 1) is incidental but
   "clean exit on parent disconnect" reads better in logs than "Python
   traceback nobody reads".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small surgical fixes flagged by the CodeQL scan on PR #4. None
change behavior; each just narrows what the static analyzer has to
prove.

- worker.py (_kill): swap the bare `try: proc.terminate()
  except ProcessLookupError: pass` for `with suppress(ProcessLookupError):
  proc.terminate()`. Matches the style we already use for the SIGKILL
  fallback in the same function.
- warmup.py (_du_bytes): same swap; the file may legitimately vanish
  between os.walk listing it and stat reading it, so suppress(OSError)
  is exactly the intent. Comment makes the why explicit. Adds
  `import contextlib`.
- benchmark.py: wrap the csv-writer block in a `with open(...) as
  csv_file:` so the file handle is guaranteed closed even if the
  generation loop raises. Removes the explicit `csv_file.close()` at
  the bottom and indents the loop body.
- test_warmup_shapes.py: pre-declare `_compute_menu_hash = None`
  before the try/import. pytest.skip() raises so the call site is
  unreachable on the ImportError path, but CodeQL doesn't model
  pytest's behavior. One-line shim shuts the warning up without
  changing test semantics.

The two false-positive findings (worker.py _kill drainer-await flagged
as "no effect") will be dismissed in the GitHub UI after this push
triggers the re-scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, requests, latencies, failures)

Exposes seven ltx2_pool_* series via a dedicated CollectorRegistry that
is wired into Dynamo's existing /metrics endpoint through
register_engine_metrics_callback. The callback registers
unconditionally so series exist (zero-valued) even when
LTX2_POOL_MODE=0, which keeps dashboards stable across the rollout
toggle. Dynamo auto-injects dynamo_namespace, dynamo_component,
dynamo_endpoint, worker_id, and model labels so these series carry
the same hierarchy labels as vLLM/SGLang worker metrics.

Metric → operational question:
  ltx2_pool_size                       — current live subprocesses (gauge for capacity tracking)
  ltx2_pool_spawn_total{shape_key}     — spawn rate per shape (churn signal)
  ltx2_pool_eviction_total{shape_key}  — LRU eviction rate (pool-too-small signal)
  ltx2_pool_request_total{shape_key,status}        — DONE/ERROR/FATAL counts
  ltx2_pool_request_latency_seconds{shape_key}     — end-to-end latency, DONE path
  ltx2_pool_cold_spawn_seconds{shape_key}          — fork → READY (cold-start budget)
  ltx2_pool_subprocess_failure_total{reason}       — failure causes
                                                     (cuda_fault, spawn_timeout,
                                                      spawn_eof, spawn_parse_error,
                                                      gen_timeout, gen_eof, desync,
                                                      parse_error, send_failed)

Pool subprocesses do not have a Dynamo endpoint and therefore do not
register a callback; all accounting happens in the parent. The
FATAL-response branch in route() is the single point that records
cuda_fault, so subprocess-side state changes are observed exactly
once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sh orphan window

A pool subprocess loads FastVideo's multiproc_executor, which spawns a
Worker child that pins ~70 GiB of GPU memory for the model. If the
parent worker process dies mid-spawn (or mid-generation), the
subprocess and its descendants are reparented to PID 1 and continue
holding the GPU until container teardown -- on k8s, that's the
liveness-probe-driven pod restart window, during which the node is
effectively wedged.

prctl(PR_SET_PDEATHSIG, SIGTERM) asks the kernel to deliver SIGTERM
to the subprocess at the moment its parent thread dies. The existing
SIGTERM handler closes the protocol pipe and calls sys.exit(0),
which propagates SIGTERM to the FastVideo Worker child via normal
process-group teardown, freeing the GPU before the kernel can decide
the node needs intervention.

Linux-only; failure (e.g. running on macOS in CI, or libc.so.6
absent) is logged to stderr and the subprocess keeps running -- the
guard is defense-in-depth, not load-bearing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… RUNBOOK

Pool metrics flow through Dynamo's worker-side system_status_server,
which is disabled by default (DYN_SYSTEM_PORT=-1). Without explicit
deployment config, our registered callback is silently dormant and
operators see zero pool visibility -- the frontend's /metrics on the
HTTP port does not aggregate worker hierarchies. Validated this gap
in the Phase 2 e2e smoke; the misconfiguration would otherwise have
shipped to production unnoticed.

Two-line guard in backend_worker emits a WARNING at startup when
DYN_SYSTEM_PORT is unset or non-numeric (defensive int parse), and an
INFO line confirming the port when it's enabled. The misconfiguration
now surfaces in pod startup logs before any traffic flows.

RUNBOOK gets a new ## Metrics section after ## Known limits naming
the seven ltx2_pool_* series, the auto-injected Dynamo hierarchy
labels, the required deployment env (DYN_SYSTEM_PORT,
DYN_SYSTEM_HOST, k8s containerPort, scrape config), and four
recommended Prometheus alert expressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants