Add Python 3.14 support by forsyth2 · Pull Request #402 · E3SM-Project/zstash

forsyth2 · 2025-11-18T23:58:18Z

Summary

Important: this should be merged to main after the production release of zstash v1.5.0.

Objectives:

Add support for Python 3.14

Issue resolution:

Equivalent of Add Python 3.14 support zppy#757, Add Python 3.14 support zppy-interfaces#44

Select one: This pull request is...

a bug fix: increment the patch version
a small improvement: increment the minor version
a new feature: increment the minor version
an incompatible (non-backwards compatible) API change: increment the major version

Small Change

To merge, I will use "Squash and merge". That is, this change should be a single commit.
Logic: I have visually inspected the entire pull request myself.
Pre-commit checks: All the pre-commits checks have passed.

forsyth2 · 2025-11-19T00:02:20Z

Xylar had noted:

we would want python >=3.11,<3.15 by next spring and python >=3.12,<3.15 by next fall. That is, we add a python version each spring and drop one each fall.

So, for the Spring release (tentatively planned as: testing period beginning April 1, release on May 1), we'd want to support four versions of Python: 3.11, 3.12, 3.13, 3.14.

Tom also noted:

numba/llvmlite don't [yet support Python 3.14]. Both are dependencies of packages xesmf, which is a dependency of xcdat. Related issue with xcdat: xCDAT/xcdat#813 (comment)

forsyth2 · 2025-11-19T00:04:58Z

Unlike zppy (E3SM-Project/zppy#757), we're going to need some fixes for zstash to support Python 3.14. Unlike zppy-interfaces (E3SM-Project/zppy-interfaces#44), the relevant changes appear to be within our control -- unfortunately that means we'll need to add some code changes to get Python 3.14 working.

Claude's response:

Looking at this GitHub Actions log for the zstash project running with Python 3.14, the main issue is a pickling error with SQLite cursor objects in multiprocessing.

The Core Problem

The tests are failing with this error:

TypeError: cannot pickle 'sqlite3.Cursor' object
when serializing tuple item 4
when serializing dict item '_args'
when serializing multiprocessing.context.Process state
when serializing multiprocessing.context.Process object

What's Happening

The code is trying to use Python's multiprocessing module to parallelize file checking/extraction operations, but it's passing a SQLite cursor object (sqlite3.Cursor) to worker processes. Cursor objects cannot be pickled (serialized), which is required when passing objects between processes.

Why This Affects Python 3.14

Python 3.14 changed the default multiprocessing start method on Unix from fork to forkserver for improved safety. The forkserver method requires all arguments to be picklable, whereas fork could sometimes work with unpicklable objects because it copied the entire memory space.

The Failing Tests

All failures are in parallel execution tests:

testCheckParallel
testCheckParallelTars
testCheckParallelVerbose
testCheckParallelVerboseMismatch
testExtractParallel
testExtractParallelTars

The Fix

The code needs to be updated to avoid passing SQLite cursor objects to worker processes. Instead, it should:

Pass the database path and have each worker open its own connection
Pass serializable data (like row IDs or file paths) instead of cursor objects
Or restructure to avoid needing the cursor in multiprocessing contexts

This is a compatibility issue introduced by Python 3.14's stricter multiprocessing requirements, not a bug in Python itself.

forsyth2 · 2025-11-19T22:13:25Z

This could end up more involved. The second commit still fails. Claude's analysis:

Looking at this test failure log, the issue is clear: the tests are failing on Python 3.14 due to a multiprocessing serialization error.

The key error appears multiple times:

TypeError: cannot pickle 'weakref.ReferenceType' object
when serializing zstash.parallel.PrintMonitor object
when serializing zstash.parallel.ExtractWorker object

This happens specifically when trying to use the --workers flag for parallel processing in tests like:

testCheckParallel
testCheckParallelTars
testCheckParallelVerbose
testCheckParallelVerboseMismatch
testExtractParallel
testExtractParallelTars

Why this is happening:

Python 3.14 has changed how multiprocessing serializes objects. The PrintMonitor and ExtractWorker classes in zstash/parallel.py contain objects (likely a SyncManager or similar) that include weak references, which cannot be pickled for transfer to child processes.

What changed in Python 3.14:

Python 3.14 made the default multiprocessing start method forkserver on all platforms (it was previously fork on Unix systems). The forkserver method requires all objects passed to child processes to be fully serializable (picklable), whereas fork could share memory directly.

The fix:

The zstash code needs to be updated to handle Python 3.14's stricter serialization requirements. Options include:

Set the start method explicitly to fork (if Unix-only is acceptable)
Refactor the parallel code to avoid passing non-picklable objects to workers
Use simpler synchronization primitives that are picklable

The test is correctly detecting this incompatibility - it's not a test issue, but a genuine compatibility problem with the parallel processing code in Python 3.14.

forsyth2 · 2025-11-21T20:54:09Z

I went through quite a few iterations with Claude and don't feel particularly closer to a solution. It's good we caught this well ahead of the Spring release (when we want to add Python 3.14 support).

Important note from Claude's summary below:

This was accidental correctness - the code had a race condition, but Python's implementation details masked it.

That is, python -m unittest tests.integration.python_tests.group_by_command.test_extract_parallel.TestExtractParallel.testExtractParallel was passing essentially by accident previously. Now, we certainly can't remove that test since ordered output has been a requirement of zstash for at least 5 years. Notice even the original tests had this case: https://github.com/E3SM-Project/zstash/pull/61/files. Specifically, tests/test.py has

error_message = 'The tars were printed in this order: {}\nWhen it should have been in this order: {}'.format(
                tar_order, sorted(tar_order))

Claude's full summary:

Python 3.14 Multiprocessing Output Race Condition

The Problem

Parallel extract tests pass on Python 3.11-3.13 but fail on Python 3.14 with output appearing in the wrong order, even though the code logic enforces correct sequencing.

Why Python 3.11-3.13 Worked

In Python 3.11-3.13, there was implicit synchronization that happened to preserve output ordering when multiple processes wrote to stdout:

The Global Interpreter Lock (GIL) provided some serialization
stdout buffering behavior inadvertently made line-level writes appear more atomic
OS-level scheduling happened to align with our synchronization logic often enough to pass tests

This was accidental correctness - the code had a race condition, but Python's implementation details masked it.

What Changed in Python 3.14

Python 3.14 introduced several changes that exposed the race condition:

PEP 703 (Free-threaded Python): Removed/weakened GIL-based implicit synchronization
Improved parallelism: Processes truly run more concurrently
Modified stdout handling: Changes to how file descriptors are inherited and buffered across processes
Different buffering behavior: stdout flushes and writes from different processes can interleave more readily

Why Our Fixes Didn't Work

We tried multiple approaches, all of which failed:

Print locks in worker processes - Lock acquired, messages printed sequentially... but print() itself breaks into multiple syscalls, and the OS can interleave writes between processes at the syscall level.
Atomic writes with sys.stdout.write() - Better than print(), but still not truly atomic across processes. The OS can split large writes or interleave small ones.
Queue-based output to main process - Workers send messages to a queue, main process controls printing order. This SHOULD work, but the test framework itself captures stdout in a way that still sees interleaving, possibly because:
- The test's subprocess.run() or similar capture mechanism has its own buffering
- Multiple writes from the main process Python code still hit OS-level buffering that can reorder
- The pipe/redirect used for capture doesn't guarantee write ordering even from a single process

The Fundamental Issue

stdout is not atomic across processes in Python 3.14. Even when a single process (main) controls the printing order, the OS and Python's IO subsystem can still interleave the output when it's being captured/redirected. This is especially true when:

Output is being captured by a test framework
Writes are happening rapidly in succession
The IO buffer isn't flushed synchronously between writes

Why This Is Hard To Fix

The only true solution would be to avoid stdout entirely and use a different IPC mechanism (like writing to a file with locks, or using a proper message queue library). But this would be a significant architectural change for what is fundamentally a Python 3.14 platform issue, not a logic bug in the code.

Recommendation

Skip these specific parallel output-ordering tests on Python 3.14 until:

Python's multiprocessing stdout handling stabilizes
The test framework is updated to handle this case
The application moves to a different output architecture (if warranted)

The actual functionality works fine - files are extracted correctly, in parallel, with proper synchronization. Only the output ordering verification fails.

forsyth2 · 2026-01-27T02:02:54Z

Claude's description of the 3 commits from 2026-01-26 (4f172d5, 3f19ca5, 9f2692d) can be found below.

The tests run via python -m unittest tests/integration/python_tests/group_by_command/test_*.py are now all passing, which was the major obstacle in the previous iterations (see comments above, from November).

I still need to run the rest of the test suite though -- notably I see the GitHub Actions have caught another issue.

Here's a concise commit message:

Fix Python 3.14 multiprocessing output ordering race condition

Python 3.14's improved parallelism (PEP 703 free-threading) exposed a race condition in parallel extraction where output could arrive out of order.

Key changes:

Main process now controls synchronization: Advances the PrintMonitor counter after printing each tar, not workers
Fixed loop condition: Changed from current_tar_idx < len(tar_ordering) (always true) to len(output_buffer) > 0 to prevent infinite loops
Simplified worker flow: Workers send output via flush_output_for_tar(), then block on next wait_turn() until main process advances counter
Added diagnostics: Progress tracking to detect hangs (no progress for 5+ seconds)

Flow:

Worker calls wait_turn() → blocks until counter matches its tar index
Worker processes tar and calls flush_output_for_tar() → sends output to queue
Main process drains queue, prints in order, advances counter
Next worker's wait_turn() succeeds

This ensures ordered output in Python 3.14 while maintaining the same performance as Python 3.11-3.13.

forsyth2 · 2026-01-28T17:56:55Z

I've run the Chrysalis tests and they also pass using the latest commit. The only test I haven't run is the Globus authentication test, because it requires NERSC HPSS, which is down today (see NERSC Center Status). That said, there weren't any Globus changes in this PR, so I don't anticipate any issues there.

forsyth2 · 2026-01-28T18:02:34Z

Review guide from Claude:

Output from Claude

Code Review Guide: Python 3.14 Support PR

Overview

This PR adds Python 3.14 support to zstash. Python 3.14's changes to multiprocessing (PEP 703 free-threading, improved parallelism) exposed race conditions in output ordering that were masked in 3.11-3.13 by GIL-based implicit synchronization.

✅ Low-Risk Changes (Quick Review)

Version Updates - Straightforward:

.github/workflows/build_workflow.yml: Python 3.13 → 3.14, added to matrix, added fail-fast: false
conda/dev.yml: Python constraint <3.14 → <3.15
setup.py: Same constraint update
setup.cfg: mypy version 3.13 → 3.14

Simple Bug Fix:

extract.py line 784: Added Python 3.12+ check for tar.extract() filter parameter (security feature)

⚠️ High-Risk Changes (Requires Careful Testing)

Multiprocessing Refactor (extract.py and parallel.py):

Root Cause Fixed

Python 3.14's improved parallelism broke the implicit stdout synchronization that accidentally made output ordering work in 3.11-3.13. The fix moves synchronization control to the main process.

Architecture Changes

1. Synchronization Model Redesign (parallel.py)

Old: Workers used condition variables and controlled their own printing
New: Main process controls synchronization via counter, workers send output to queue
Key insight: Main process calls advance_to_next_tar() after printing (line 383), unblocking next worker
Review: Polling with 0.1s sleep (line 464) vs event-driven - CPU/latency tradeoff necessary for 3.14 compatibility

2. Work Distribution (lines 329-342)

Changed from heap-based greedy → round-robin for simpler ordering guarantees
Sorts workers' matches by tar (line 347) to ensure sequential processing
Review: Load imbalance acceptable given the ordering requirements?

3. Output Collection (lines 367-436)

New output_queue for worker → main communication
output_buffer dict for reordering out-of-sequence arrivals
Main process drains queue, prints in order, advances counter
Critical fix: Loop condition includes len(output_buffer) > 0 to handle buffered output after workers finish
Review:
- Buffer memory usage with many tars
- 5s stall detection threshold (line 421) appropriate?
- 180s max timeout (line 387) based on real workloads?

4. Database Handling (lines 547-552)

Workers open own DB connections (required - cursors not picklable in forkserver mode)
Connection closed properly (lines 827-828)
Review: SQLite concurrent reader behavior confirmed?

5. Worker Flow (critical for reviewers to understand)

1. wait_turn(tar_N) → blocks until counter == N
2. extract tar_N files
3. flush_output_for_tar(tar_N) → sends to queue
4. [main process prints tar_N, calls advance_to_next_tar(tar_N)]
5. wait_turn(tar_N+1) → succeeds because counter advanced

6. Error Handling

Exception wrapper (lines 502-521) ensures failures reported even on crashes
Timeout on turn-waiting prevents indefinite hangs
Progress logging every 30s (line 355)

Testing Checklist

Critical - Major refactor necessitated by Python 3.14 compatibility:

All Python versions: 3.11, 3.12, 3.13, 3.14
Serial mode (--workers 1) - uses different code path
Parallel modes (--workers 2, 4, 8, 16)
Output determinism: Multiple runs produce identical output order
Previously failing tests now pass on 3.14:
- testCheckParallel*
- testExtractParallel*
No regressions on 3.11-3.13
Performance: Comparable to 3.13 (slight overhead from polling expected)
Stress test: Large file sets, mixed tar sizes
Edge cases:
- Single tar
- More workers than tars
- Workers crash mid-extraction

Key Review Points

Line 335: Sorting tar names - verify extraction order is correct
Line 347: Worker matches sorted by tar - ensures sequential processing within worker
Line 383: advance_to_next_tar() - critical synchronization point
Line 390: Loop condition includes buffer check - prevents premature exit
Line 464: Polling interval - necessary tradeoff for 3.14 compatibility
Line 625-646: Wait-before-work pattern - prevents race condition
Line 808: flush_output_for_tar() - replaces old two-step enqueue+print

Context for Reviewers

This wasn't over-engineering - Python 3.14 broke the original implementation through:

PEP 703 free-threading removed GIL-based implicit synchronization
stdout is no longer atomic across processes
Improved parallelism means true concurrency, exposing race conditions

The refactor moves from "workers control printing with locks" to "main process orchestrates everything" - the only architecture that works reliably in 3.14.

Known Tradeoffs

Round-robin less balanced than heap (acceptable for ordering guarantees)
Polling adds latency/CPU overhead (0.1s is reasonable)
Buffer memory grows with out-of-order completions (bounded by number of tars)
Hardcoded timeouts may need tuning for very slow systems

forsyth2 · 2026-01-28T19:53:38Z

After some testing the branch at various stages, Claude was able to generate this summary table:

Output from Claude

Case	Test Result	Comments on Output (Ordering)
Case 1: Base (0 commits)	✅ PASS	Tar files extracted sequentially in correct order: 000000.tar → 000001.tar → 000002.tar → 000003.tar → 000004.tar. No ordering issues detected.
Case 2: +1 commit (Python 3.14 support)	❌ FAIL	Fatal error during parallel extract: TypeError: cannot pickle 'sqlite3.Cursor' object. No extraction output produced. Test fails because expected file extraction messages are absent. The sqlite3 cursor cannot be serialized for multiprocessing.
Case 3: +7 commits (functionality fixes)	❌ FAIL	Ordering violation due to "Worker waiting" debug messages. The tar order array is [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 0, 0, 0, 1, 2, 3, ...] instead of [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...]. The first part [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] comes from the initial non-verbose extraction output. The problem is the debug section contains DEBUG: Worker waiting for tar 000001.tar (index 1), current index is 0 which adds extra tar number mentions before tar 0 completes in the parsed output, breaking sequential ordering.
Case 4: +4 commits (major refactor)	✅ PASS	Tar files extracted in correct sequential order with proper synchronization. All expected "Extracting" messages present. Test passes completely - both functional correctness and ordering requirements met.
Case 5: -4 commits, +1 commit (comment out "Worker waiting" line)	❌ FAIL	Ordering still violated, but differently. Tar order is now [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, ...] instead of expected [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, ...]. Removing "Worker waiting" messages eliminated the early interleaving (no more 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 0, 0, 0, 1, 2, 3), but the fundamental issue remains: the first extraction block [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] is still counted, followed by the debug write() calls [0, 0, 0, 0, 1, 1, 1, 1, ...], creating duplicate/interleaved counting.
Case 6: -1 commit, +1 commit (remove both "Worker waiting" and "write()" debug lines)	❌ FAIL	Cleaner ordering violation. Tar order is [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4] instead of [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4]. Both debug message types removed, resulting in simpler pattern: first extraction produces [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] (2 tar mentions per archive), second extraction also produces [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] (2 tar mentions per archive). Debug section now only shows: DEBUG: Worker got turn for tar 000000.tar! and DEBUG: Set curr_tar to 000000.tar (repeated for each tar). The root cause is clear: two separate extraction runs are being counted, not interleaved debug output.

Key insight from Case 6: Removing all the verbose debug logging reveals the actual problem—the test is capturing output from two consecutive extraction runs (both the first + zstash extract -v --hpss=none --workers=3 and what appears to be a second run), and the regex is counting tar mentions from both, creating the doubled pattern.

Yes, the major refactor in Case 4 was necessary.

Here's why:

The Root Problem (revealed by Case 6):
The test captures output from what appears to be two extraction runs, and compare_tar_orders() counts tar mentions from both runs. Each run mentions each tar file twice (e.g., 000000.tar appears in "exists. Checking expected size" and "Opening tar archive"), resulting in the pattern [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4].

Why Cases 3, 5, and 6 all fail:

Case 3: Debug messages added MORE tar mentions to an already-broken output capture
Case 5: Removed "Worker waiting" messages but still had "write()" debug messages AND the double-extraction problem
Case 6: Removed all verbose debug but still had the double-extraction problem

Why Case 4 succeeds:
The major refactor must have fundamentally changed how output is captured/produced during parallel extraction. It likely:

Ensured only ONE extraction run's output is captured by the test, OR
Changed the output mechanism so that tar file mentions appear in strict sequential order even across multiple processing phases, OR
Modified the test infrastructure to properly isolate/reset between runs

Simply removing debug messages (Cases 5 & 6) only treated symptoms. The refactor in Case 4 addressed the architectural issue of how parallel workers coordinate their console output and/or how the test captures that output.

Conclusion: Yes, the major refactor was necessary because the problem wasn't just verbose logging—it was a fundamental issue with output ordering/capture in the parallel execution model.

Looking at the test file, I need to revise my previous answer. The test structure shows there are actually TWO separate zstash extract commands run sequentially:

First run (line 65-78): zstash extract -v --hpss={} --workers=3 (with verbose flag)
Second run (line 80-93): zstash extract --hpss={} --workers=3 (without verbose flag)

Both runs call self.compare_tar_orders(output + err) to check ordering.

Revised Understanding:

The Root Problem:
Each individual extraction run is producing output where tar files are mentioned out of order. The pattern [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] from a single run shows that:

Tar 0 is mentioned twice (e.g., "000000.tar exists" + "Opening tar archive 000000.tar")
Tar 1 is mentioned twice
Etc.

But they appear interleaved: tar 0 (twice), then tar 1 (twice), then tar 2 (twice)... instead of all tar 0 mentions grouped together.

Why Cases 3, 5, and 6 fail:
Looking at Case 6's output more carefully, the pattern [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4] is:

First half: [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] from the verbose run
Second half: [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] from the non-verbose run

Each run individually has the wrong ordering! The workers are processing tars concurrently but their output isn't properly serialized.

Why Case 4 succeeds:
The major refactor was absolutely necessary because it fixed the fundamental parallel output coordination problem. It likely implemented proper output serialization so that within each extraction run, all mentions of tar N complete before any mention of tar N+1 begins, giving the correct pattern [0, 0, 0, 0, 1, 1, 1, 1, ...] for each run.

Conclusion: Yes, the major refactor was necessary. The problem wasn't about capturing two runs—that's intentional test design. The problem was that each individual parallel extraction run wasn't maintaining sequential tar ordering in its output, and the refactor fixed that core parallelization/synchronization issue.

forsyth2 · 2026-01-28T20:32:11Z

It turns out that case 3 was in fact simply failing because of the added debug lines.

Two important lessons:

Claude appears to be over-eager to engineer overly-complex solutions. Here, we see that the major refactor (case 4) turned out to be completely unnecessary. It is evidently extremely important that in AI prompts we 1) specify to change as little code as possible and 2) include any key pieces of code that are likely to be the issue (so it can focus in on that).
- Jill has noted previously in another PR how Claude had over-engineered a solution.
- (While Claude did note "This wasn't over-engineering" above, that was because I had said in the prompt the refactor had turned out necessary. That is, it wasn't the same context window where Claude had over-engineered the original solution).
The zstash output review tests are extremely sensitive to printed output. Simply introducing new debug lines can easily break them.
- I can't find the discussion to link, but Tony had previously run into this when introducing debug lines.

forsyth2 · 2026-01-28T21:03:31Z

I've updated the commits. I removed the most recent 4 commits and added 2 commits: 137c101 removes extra write lines, 976397c fixes tar extraction (this is the same change as the 4th commit that was removed).

The Chrysalis tests are passing. Remaining action items:

Test on Perlmutter (once HPSS is up)
Visually inspect code changes
Draft a code review guide

forsyth2 · 2026-02-02T20:30:47Z

I've confirmed all tests pass on Chrysalis & Perlmutter. I've squashed the commits into logical units: 7740c12 for adding Python 3.14 support, 3cdcdd3 for implementation changes necessary to keep expected behavior.

forsyth2 · 2026-02-03T01:22:41Z

I was able to remove more extraneous changes by reviewing the point in the commit history where the debug lines began interfering with the ordering tests. The 3 commits now are:

Support Python 3.14. This just adds Python 3.14 to the conda dev environment and makes any changes related to that.
Make changes to implementation details. This is based off early commits from November.
Add final fixes. Changes necessary to get all tests passing. Added in January.

Commits 1 & 3 are known to be necessary. If any extraneous changes remain, they'd be in commit 2.

forsyth2 · 2026-02-03T01:41:34Z

Chrysalis tests pass as well.

forsyth2 · 2026-02-03T01:56:18Z

I think I've removed the over-engineering aspects. Remaining action items:

Self review
Have others code review

Claude's summary guide:

Summary: Python 3.14 Compatibility & Deadlock Fix

The Problem

Two distinct issues blocking Python 3.14 support:

Pickle Error: Python 3.14 changed default multiprocessing start method from fork to forkserver, which cannot pickle sqlite3.Cursor objects
- Error: TypeError: cannot pickle 'sqlite3.Cursor' object
- Affected 12 parallel extract/check tests
Deadlock: Multiple wait_turn() calls per tar created circular dependency
- Workers would block waiting for their own turn repeatedly
- Tests would hang indefinitely

The Solution

1. Database Connection Management

Before: Main process passed shared sqlite3.Cursor to all workers
After: Each worker opens its own database connection

# In extractFiles():
if cur is None:  # Worker process
    con = sqlite3.connect(get_db_filename(cache))
    cur = con.cursor()
    close_at_end = True

2. Synchronization Simplification

Before: Complex flow with multiple wait points

Process tar → wait → log → wait → extract → print → wait → done

After: Single wait at start, then process entirely

wait_turn() → set_curr_tar() → process all files → print_all_contents()

Key changes:

wait_turn() called once per tar at beginning (not during or after processing)
Removed wait_turn() from print_all_contents() - now just drains queue
Logging setup moved to worker initialization (once per worker, not per tar)

3. Defensive Measures

180-second timeouts to prevent hung tests
Try-catch blocks around all synchronization calls
Workers can still report failures even if coordination fails

Implementation Details

`PrintMonitor` Changes

Replaced condition variables with polling + sleep (more robust)
Changed from queue-based to index-based tracking (simpler)
Added timeout mechanisms with clear error messages

`extractFiles()` Changes

Split into wrapper + _extractFiles_impl() for better error handling
Made cur parameter optional
Moved logging setup outside tar processing loop

Worker Flow

# Once per worker (setup)
- Setup logging handler → print_queue

# Once per tar
- wait_turn(tar)              # Acquire "lock" on this tar's output
- set_curr_tar(tar)
- Process all files           # Logs accumulate in queue
- print_all_contents()        # Drain queue (no waiting)
- Advance to next tar         # Implicitly releases "lock"

Why This Approach

Minimal for pickle fix: Only change was removing cursor from multiprocess args
Necessary for deadlock: Single wait-point prevents circular dependencies
Production-ready: Timeouts ensure tests don't hang forever
Maintainable: Simpler synchronization model is easier to reason about

Impact

✅ All 12 tests pass on Python 3.14
✅ No deadlocks or hangs
✅ Maintains correct output ordering
✅ Backwards compatible with Python 3.11-3.13

Trade-offs

Added complexity: Timeout logic and polling vs. condition variables
Gained reliability: Tests complete reliably, easier debugging with timeouts
Net result: More code, but addresses two real blockers for Python 3.14

forsyth2 · 2026-02-03T18:43:26Z

I think I've removed the over-engineering aspects.

To confirm this further, I went back through my Claude conversations from November. Re-applying the diffs confirms the following as the debugging workflow:

Add Python 3.14 support. Causes many tests fail
Multiple iterations to get to just 2 tests failing -- the ordering tests. These are legitimate errors though -- the tars are printing in the wrong order. Getting to this point also involved dealing with hangs/deadlocks.
Further changes were necessary to get the tars printing in the correct order. (Unfortunately, it was during this part that Claude inserted debugging lines that continued to fail the ordering tests even though the behavior had been corrected).

To summarize, it appears everything in commit 2 here really was necessary to 1) get all tests passing on Python 3.14 and 2) not cause deadlocks.

zstash/extract.py

forsyth2 · 2026-02-03T20:07:50Z

Commit 4 & Commit 5 reduce the line change count:

	First 3 commits	All 5 commits
Entire PR	+246/-119	+115/-117
`zstash/extract.py`	+140/-52	+64/-46
`zstash/parallel.py`	+93/-58	+42/-63

forsyth2

Here are explanatory comments from Claude re: the PR as these 5 diffs:

> git log --oneline | head -n 6
0ebd7a5 Restore comments and type annotations
f33ea80 Claude's changes to reduce over-engineering
734f200 Fixes to make Perlmutter tests pass
a54c842 Implementation changes to support Python 3.14
e315362 Add Python 3.14 support
880b9fd Optimize zstash update by using scandir (#420)

All tests pass on Perlmutter. I still need to check on Chrysalis, but based on previous iterations, I don't expect issues there.

EDIT: I've also tested on Chrysalis now.

Summary: The main theme is replacing complex load-balancing + synchronization with simple, predictable ordering. This eliminates deadlocks at the cost of potentially less optimal work distribution (but for most workloads, round-robin is fine).

.github/workflows/build_workflow.yml

zstash/extract.py

forsyth2 · 2026-02-03T20:25:14Z

zstash/extract.py


    # For worker i, workers_to_tars[i] is a set of tars
    # that worker i will work on.
+    # Round-robin assignment for predictable ordering


Simplified Work Distribution

Function: multiprocess_extract() - worker assignment logic

Old approach: Used a min-heap to assign tars to workers based on total file size (trying to balance workload)

New approach: Simple round-robin assignment (tar 0 → worker 0, tar 1 → worker 1, etc.)

Why the change: The heap-based approach was complex and could result in unpredictable processing order, leading to deadlocks when workers waited for their turn to print. Round-robin is predictable and prevents deadlocks.

zstash/parallel.py

forsyth2 · 2026-02-03T20:28:35Z

zstash/extract.py

            # let the process know.
            # This is to synchronize the print statements.
+
+            # Wait for turn before processing this tar


Extraction Loop Changes

Function: extractFiles() - main loop

Added wait_turn() call before processing each tar - ensures workers process tars in order

Moved print_all_contents() to after marking tar as done - ensures all output is flushed before advancing

Added sleep in the monitoring loop (time.sleep(0.01)) - prevents busy-waiting and gives workers CPU time

Added extra drain of failure queue after processes finish - catches any failures that arrived just as processes were ending

zstash/extract.py

forsyth2 · 2026-02-03T21:32:56Z

This PR adds Python 3.14 support and adds implementation changes to keep expected behavior while using Python 3.14. Most importantly, there are large changes to the parallelism workflow, notably replacing the heap method with the round-robin method (which Claude claims won't impact performance). These changes were ultimately necessary to keep the output ordering tests passing and to avoid deadlocks. It seems that Python 3.14 has changed some multiprocessing handling. I've put explanations from Claude on relevant code blocks above (see here).

@golaz -- Do you think this change acceptable? It might be more efficient to go over this PR together at the next EZ discussion.

@chengzhuzhang @tomvothecoder @andrewdnolan -- cc'ing you all since this PR resolves the issue of supporting Python 3.14 for the upcoming E3SM Unified release. Please let me know if you have any suggestions/concerns.

Thanks all!

Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>

forsyth2 · 2026-02-25T18:33:46Z

I rebased to include a fix specifically linked to the actual DevOps work and not the implementation changes

forsyth2 · 2026-02-25T18:44:12Z

@copilot I tried another approach to this work in #424. However that turned out to be less efficient, so I closed it. That PR did have some advantages though.

Concern	Approach 1 (this PR)	Approach 2 (#424)
HPSS auth in workers	⚠️ Uncertain — depends on whether spawned workers inherit Kerberos credentials	✅ Not needed — main process downloads everything
SQLite in workers	✅ New connection per worker	✅ New connection per worker
Spawn-compatible IPC	✅ Manager-backed primitives	✅ Manager-backed primitives
Output ordering correctness	⚠️ Semantic rename may introduce race	✅ Explicit edge-case handling
Failure queue drain	✅ Final drain added	❌ Missing
Manager cleanup	❌ Leaked	✅ Explicit shutdown
API surface changes	More invasive	Minimal
Parallel efficiency	✅ Download + extract overlap across workers	❌ All downloads serialized before any extraction begins
Correctness at scale	✅ If auth works, strictly better	✅ Always correct, but serially bottlenecked on the slow path

Please open a new pull request for this branch (add-python3.14) that 1) handles the negative aspects of "Approach 1 (this PR)" in the table above and 2) addresses code review comments already on this PR.

Copilot · 2026-02-25T18:44:21Z

@forsyth2 I've opened a new pull request, #425, to work on those changes. Once the pull request is ready, I'll request review from you.

… manager cleanup, queue drain, precomputed tar index map Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>

forsyth2 · 2026-02-25T19:12:23Z

Cherry-picked commit from #425.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T19:18:45Z

zstash/extract.py

+    if cur is None:
+        con: sqlite3.Connection = sqlite3.connect(
+            get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES
+        )
+        cur = con.cursor()
+        close_db: bool = True
+    else:
+        close_db = False


Potential resource leak: If an exception occurs between lines 524-527 (when opening the database connection), the connection and cursor will not be properly closed because close_db is set to True on line 528, but if an exception happens before that line, the finally block or exception handler won't close the connection. Consider wrapping the connection opening in a try block or using a context manager to ensure proper cleanup.

Suggested change

if cur is None:

con: sqlite3.Connection = sqlite3.connect(

get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES

)

cur = con.cursor()

close_db: bool = True

else:

close_db = False

con: Optional[sqlite3.Connection] = None

close_db: bool = False

if cur is None:

try:

con = sqlite3.connect(

get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES

)

cur = con.cursor()

close_db = True

except Exception:

if con is not None:

con.close()

raise

Copilot · 2026-02-25T19:18:45Z

conda/dev.yml

  - pip
-  - python >=3.11,<3.14
+  - python >=3.11,<3.15
+  - setuptools


The addition of setuptools as an explicit dependency is good for Python 3.12+ compatibility, as setuptools is no longer automatically installed with pip. However, consider specifying a minimum version to ensure compatibility with the package's setup.py requirements. For example, setuptools >=65.0.0 would ensure modern setuptools features are available.

Suggested change

- setuptools

- setuptools >=65.0.0

Copilot · 2026-02-25T19:18:46Z

zstash/extract.py

-            if tar in workers_to_tars[workers_idx]:
-                # This worker gets this db_row.
+            if db_row.tar in workers_to_tars[workers_idx]:
                workers_to_matches[workers_idx].append(db_row)


The inner loop at lines 356-358 doesn't break after finding the worker that should process a tar. This means if a tar is somehow in multiple workers' sets, the db_row will be added to multiple workers' match lists, leading to duplicate processing. While this should not happen given the assignment logic, adding a break statement after line 358 would make the code more robust and efficient by avoiding unnecessary iterations once the correct worker is found.

Suggested change

workers_to_matches[workers_idx].append(db_row)

workers_to_matches[workers_idx].append(db_row)

# Once the appropriate worker is found, no need to check others.

break

Copilot · 2026-02-25T19:18:46Z

zstash/extract.py

            elif extract_this_file:
-                tar.extract(tarinfo)
+                if sys.version_info >= (3, 12):
+                    tar.extract(tarinfo, filter="tar")


The tarfile.extract filter parameter was introduced in Python 3.12 to address security concerns. However, the "tar" filter is deprecated as of Python 3.14 (PEP 706). For Python 3.14+ compatibility, consider using filter="data" instead, which is the recommended secure filter. The "data" filter is safer and future-proof. Alternatively, explicitly handle different Python versions if the "tar" filter's behavior is required.

Suggested change

tar.extract(tarinfo, filter="tar")

tar.extract(tarinfo, filter="data")

We must use filter="tar" or else we get this test failure:

====================================================================== FAIL: test_hpss_none_fs_off (tests.integration.python_tests.group_by_workflow.test_cache_fs.TestCacheFs.test_hpss_none_fs_off) ---------------------------------------------------------------------- Traceback (most recent call last): File "/global/u1/f/forsyth/ez/zstash/tests/integration/python_tests/group_by_workflow/test_cache_fs.py", line 152, in test_hpss_none_fs_off self.assertTrue(os.path.islink("dir2/file1_soft.txt")) # DIFFERENT from fs_on ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: False is not true ---------------------------------------------------------------------- Ran 1 test in 1.045s FAILED (failures=1)

Copilot · 2026-02-25T19:18:46Z

zstash/extract.py

+    # Open database connection if not provided (parallel case)
+    if cur is None:
+        con: sqlite3.Connection = sqlite3.connect(
+            get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES
+        )
+        cur = con.cursor()
+        close_db: bool = True
+    else:
+        close_db = False


The database cursor handling has been changed to open a new connection per worker process when running in parallel. This is necessary because sqlite3 connections and cursors are not thread-safe or process-safe. However, verify that all test cases pass with this change, particularly tests that involve concurrent database access, as this changes the connection model significantly.

Suggested change

# Open database connection if not provided (parallel case)

if cur is None:

con: sqlite3.Connection = sqlite3.connect(

get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES

)

cur = con.cursor()

close_db: bool = True

else:

close_db = False

# Open database connection if not provided, or always in parallel.

con: Optional[sqlite3.Connection]

close_db: bool = False

if multiprocess_worker is not None:

# In parallel, each worker must use its own sqlite3 connection

# because sqlite3 connections and cursors are not safe to share

# between processes.

con = sqlite3.connect(

get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES

)

cur = con.cursor()

close_db = True

elif cur is None:

con = sqlite3.connect(

get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES

)

cur = con.cursor()

close_db = True

Copilot · 2026-02-25T19:18:47Z

zstash/extract.py

+    # Close database connection if we opened it
+    if close_db:
+        cur.close()
+        con.close()


The database connection cleanup at lines 742-744 only happens if the function exits normally. If an exception occurs in the try block (lines 629-712) or elsewhere in the function, the database connection will not be closed, leading to a resource leak. Consider wrapping the database operations in a try-finally block or using a context manager to ensure the connection is always closed when opened by this function.

Copilot · 2026-02-25T19:18:47Z

zstash/extract.py

+            while True:
+                failures.append(failure_queue.get_nowait())
+        except queue.Empty:
+            pass


Adding a sleep(0.01) in the main process loop is good for reducing CPU usage while waiting for worker processes. However, this means the main process will check for failures at most 100 times per second. For very short-running jobs, this could add up to 10ms of latency before failures are detected. This is likely acceptable, but consider documenting this trade-off.

Suggested change

pass

pass

# Sleep briefly to avoid busy-waiting while worker processes run.

# This limits failure detection to ~100 checks per second (worst-case

# ~10 ms latency for very short-running jobs), which is an intentional

# trade-off for lower CPU usage in the main process.

Copilot · 2026-02-25T19:18:47Z

zstash/parallel.py

+            if self._current_tar_index.value == tar_index:
+                self._current_tar_index.value += 1


Potential race condition: The check at line 87 verifies if the current tar index matches, but there's no guarantee that another worker hasn't already incremented it between the check and the increment. While the lock protects the increment operation itself, if multiple workers call this method for the same tar_index concurrently, only the first one will increment the counter, and the others will be silently ignored. Consider adding an else clause to detect when the counter has already been incremented, or verify this is the intended behavior and add a comment explaining why silent failure is acceptable.

Suggested change

if self._current_tar_index.value == tar_index:

self._current_tar_index.value += 1

current_index = self._current_tar_index.value

if current_index == tar_index:

self._current_tar_index.value = current_index + 1

else:

raise RuntimeError(

"Attempted to advance tar index for tar {} from {} but "

"current index is {}. This likely indicates that another "

"worker has already advanced the tar index.".format(

workers_curr_tar, tar_index, current_index

)

)

Copilot · 2026-02-25T19:18:47Z

zstash/parallel.py

-                # Wait 0.001 to see if it's the worker's turn.
-                self._cv.wait(0.001)
+            attempted = True
+            time.sleep(0.01)


The polling interval was changed from 0.001 seconds (1 millisecond) to 0.01 seconds (10 milliseconds). This is a 10x increase in the polling interval. While this reduces CPU usage, it may increase latency in task switching between workers. For workloads with many small tar files, this could impact performance. Consider adding a comment explaining this trade-off, or making the polling interval configurable if performance becomes an issue.

Copilot · 2026-02-25T19:18:48Z

zstash/extract.py

+    except queue.Empty:
+        pass
+
+    manager.shutdown()


The manager.shutdown() call may prevent proper cleanup of failures. This is called immediately after collecting failures from processes, but the PrintMonitor and ExtractWorker objects may still hold references to manager-created objects like locks and values. Consider moving the shutdown call after all cleanup is complete, or verify that all worker references to manager objects have been released. Additionally, verify that manager.shutdown() is the correct method - it may need to be manager.join() depending on the multiprocessing.Manager() implementation.

forsyth2 · 2026-02-26T00:57:45Z

Following up on the comparison with #424 in #402 (comment):

Diagrams explaining the changes, from Claude

Looking at these diffs, the two PRs share some changes (both replace Condition+Queue+ctypes with a manager-based index+lock, both pass cur=None to workers and open a fresh DB connection inside) but differ significantly in their core strategy.

The root cause is Python 3.14 switching from fork to spawn as its default process start method. Fork clones the parent — children inherit everything. Spawn starts fresh — children only get what you explicitly hand them, and that data must be serializable (picklable).

This broke two things:

Bug 1 — SQLite cursor. The old code passed cur directly to each worker. With spawn, Python tries to pickle it to send it over — and SQLite cursors can't be pickled. Both PRs fix this identically: pass None instead, and have each worker open its own fresh connection.

Bug 2 — PrintMonitor internals. The old PrintMonitor used a raw multiprocessing.Condition and multiprocessing.Queue which worked under fork (inherited from parent memory) but aren't picklable for spawn. Both PRs replace this with a manager-based integer index + manager.Lock() — manager proxies communicate over a socket and are picklable.

Where they diverge is HPSS authentication. With spawn, worker processes have no Kerberos credentials. PR #402 tries to work around this by reading args.hpss instead of config.hpss, but workers still call hpss_get() themselves — which may still hang if auth requires more than just knowing the path. PR #424 takes the cleaner approach: a new _prefetch_tars_for_parallel() function downloads all needed tars in the main process (which has auth) before any workers are spawned. Workers then just find the tars already on disk and skip HPSS entirely.

forsyth2 · 2026-02-26T00:58:33Z

Note that a full explainer on parallelism in zstash as it exists on the main branch can be found at #426

forsyth2 · 2026-02-26T22:44:11Z

Comparing parallelism

Performance data

Using the scripts from #427. Caveats:

There's a bug in the legend -- the baseline (main branch) bar is simply the lighter colored one, regardless of hatching.
Globus isn't displayed because the endpoint I needed wasn't available.
This data comes from single runs. It would be best to do multiple runs and compute averages but the data generation script takes 40 minutes per branch (without including Globus). The problem is we really need to test decent sizes of data, which naturally require more time.

Results from this PR (#402):

Compared to main, PR 402 ran 1.28x longer on extracting an archive containing build/ (lots of small files) and run/

Results from the Copilot approach (#424):

Compared to main, PR 424 ran 1.13x longer on extracting an archive containing init/ and build/ & 1.12x longer on extracting an archive containing init/ and run/ (init/ has a few large files)

Other notes

I remain highly skeptical of the approach in #424 as it relied on prefetching tar files before deploying workers. But I am concerned about that 1.28x speedup mentioned above for this PR.

forsyth2 · 2026-02-26T22:55:58Z

zstash/extract.py

-    work_to_workers: List[Tuple[int, int]] = [(0, i) for i in range(num_workers)]
-    heapq.heapify(workers_to_tars)
+    work_to_workers: List[Tuple[float, int]] = [(0.0, i) for i in range(num_workers)]
+    heapq.heapify(work_to_workers)


This was apparently a mistake that was fixed in this PR (see #426). That is, it should have always been work_to_workers here, despite that line being added over 6 years ago.

forsyth2 · 2026-02-26T23:01:10Z

@copilot Please address all the comments from the code review agent unless I've specifically noted it as an unnecessary change.

Copilot · 2026-02-26T23:01:19Z

@forsyth2 I've opened a new pull request, #428, to work on those changes. Once the pull request is ready, I'll request review from you.

forsyth2 · 2026-02-27T02:01:06Z

I asked Claude to generate a concise summary explaining this work (see below the divider bar).

Notes:

We must support Python 3.14, so we must merge this PR before the Unified release candidate deadline. (I suppose technically there's the alternative of disabling --num_workers > 1 in Python 3.14 but that would be a significant regression).
This PR (Add Python 3.14 support #402) was generated off-and-on over the past few months, mostly using Claude and recently Copilot agents/reviewers (PR Add Python 3.14 support #428 addresses the code review comments on this PR, since Copilot seems only able to open a new PR rather than adding commits directly).
I also tried having Copilot start from scratch -- that's the approach of Fix Python 3.14 compatibility: AbsoluteLinkError, pickling errors, config propagation, HPSS hang, and print_all_contents deadlock in extract.py and parallel.py #424, which I ended up closing as an inferior option, but is nevertheless described below.
This is not a new feature; all of this work is for DevOps (i.e., supporting Python 3.14)
I did work on performance profiling (Add performance profiling as a standard zstash test #427). I eventually want to incorporate this into the standard zstash test suite, since performance is an increasingly important concern. In any case, I'm concerned the profiling I managed to do is still insufficient, as we really need more data points to average, but the profiling already takes a long time to run.

Adding Python 3.14 Support to `zstash`

The Core Problem: How Python Starts New Processes

zstash uses parallel workers to extract multiple tar archives at the same time. To do that, it has to spin up multiple Python processes. The critical question is: how does a new worker process get started?

Think of it like hiring a temp worker for a job:

Fork (old way, Python ≤ 3.13 on Linux): You photocopy yourself. The worker starts as an exact clone of you — they already have your keys, your login, your memory of everything. Fast, but messy.
Spawn (new default, Python 3.14): You hire someone fresh off the street. They know nothing. You have to hand them everything they need explicitly, and anything you hand them has to fit through a mail slot (i.e., it must be serializable — convertible to bytes).

Python 3.14 switched the default from fork to spawn. This broke zstash in two concrete ways.

What Broke

Bug 1: The Database Cursor

The old code handed each worker a live SQLite database cursor. Under fork, this worked fine — the worker had a copy of it in memory. Under spawn, Python tries to package the cursor up and pass it through the mail slot — but SQLite cursors can't be packaged that way. Crash.

Fix (both PRs): Pass None to workers instead of the cursor. Each worker opens its own fresh database connection on startup.

Bug 2: The Print Synchronization System

Workers need to print their output in a consistent order (one tar at a time, in sequence). The old system used a Condition object and a Queue to coordinate this. Under fork, workers shared these objects in memory. Under spawn, Python again tries to package them up — and these objects also can't be packaged that way. Crash.

Fix (both PRs): Replace them with "manager-backed" equivalents. A Manager runs as a separate helper process and hands out proxy objects that do work across spawn — they communicate via a socket rather than shared memory.

The Two Approaches: Where They Differ

Both PRs fix Bugs 1 and 2 identically. They differ in how they handle HPSS authentication.

With spawn, workers start fresh with no credentials (no Kerberos ticket). If a worker tries to download a tar from HPSS itself, it may hang waiting for authentication that will never come.

	PR 402 / 428	PR 424
HPSS strategy	Workers still attempt HPSS downloads; reads `args.hpss` as a fallback	Main process pre-downloads all needed tars before workers start
Auth concern	Workers may still lack credentials for HPSS	Workers never touch HPSS — tars are already on disk
Conceptual approach	Fix the symptoms iteratively	Sidestep the problem entirely

PR 424's _prefetch_tars_for_parallel() function is the key addition: the main process (which does have HPSS credentials) downloads everything up front, so workers only ever read from local disk.

However, this approach has a significant concern: prefetching is sequential. All tars are downloaded one by one in the main process before a single worker is spawned. At real-world zstash scales — archives in the tens of terabytes — this could mean waiting a very long time before any parallel extraction begins, undermining the whole point of using multiple workers. PR 402/428's approach of having workers handle their own downloads at least keeps the download and extraction pipeline moving concurrently, even if the HPSS auth situation is messier.

Performance

All numbers below are from the --hpss=none --num_workers=2 case (local cache, parallel extraction). Note: these are single runs, so treat them as directional, not definitive.

Branch	Notable result
PR 402	Up to 1.28× slower than `main` on small-file-heavy archives (`build/` + `run/`)
PR 424	Up to 1.13× slower than `main` on archives with large files (`init/` + others)
PR 428 (402 + code review fixes)	Recovered some of PR 402's regression — `init/` + `build/` ran in 0.75× the time of PR 402

The slowdowns are modest and likely stem from the overhead of manager-based synchronization (socket round-trips instead of shared memory). At real-world scales (10–20 TB), the HPSS transfer time will dominate and these differences should become negligible — though as noted above, PR 424's sequential prefetch may introduce a more serious bottleneck at that scale that these small benchmarks don't capture.

Summary

Python 3.14's switch to spawn-by-default meant workers could no longer silently inherit state from the parent process. The fixes required making all inter-process communication explicitly serializable. The two PRs agree on the core fixes; they differ mainly in how they handle HPSS downloads. PR 424's prefetch-everything-first approach is conceptually clean but raises concerns about sequential bottlenecks at scale. PR 402/428 keeps downloads concurrent with extraction but has a messier relationship with HPSS authentication in spawned workers. Neither approach has been benchmarked at production scale, and that's probably the most important open question before merging.

forsyth2 · 2026-02-27T17:44:11Z

zstash/extract.py

+                failures.append(failure_queue.get_nowait())
+        except queue.Empty:
+            pass
+        time.sleep(0.01)


Another note from Claude, after mentioning "We routinely extract archives 10-20 TB in size. The default individual tar size is 256 GB."

With 40-80 tars running for hours each, this loop spins ~360,000 times per hour doing essentially nothing but checking is_alive(). That's harmless on its own, but consider: if a worker crashes without putting anything in failure_queue, the main process has no timeout mechanism. It will spin forever. At 10-20 TB job sizes, a silent worker death — due to OOM, a storage node going down, an HPSS timeout — is not a hypothetical.

forsyth2 · 2026-02-27T18:55:42Z

Direct links to important comments:

forsyth2 self-assigned this Nov 18, 2025

forsyth2 added the DevOps CI/CD, configuration, etc. label Nov 18, 2025

This was referenced Nov 18, 2025

Add Python 3.14 support E3SM-Project/zppy-interfaces#44

Merged

Add Python 3.14 support E3SM-Project/zppy#757

Merged

forsyth2 force-pushed the add-python3.14 branch from d619ff1 to 1f1e8e1 Compare November 21, 2025 03:58

forsyth2 mentioned this pull request Dec 8, 2025

Potential fix to token timeout #407

Closed

7 tasks

forsyth2 force-pushed the add-python3.14 branch from 9ac2940 to 976397c Compare January 28, 2026 21:00

forsyth2 force-pushed the add-python3.14 branch from 976397c to 3cdcdd3 Compare February 2, 2026 19:57

Add Python 3.14 support

e315362

forsyth2 force-pushed the add-python3.14 branch from 3cdcdd3 to 734f200 Compare February 3, 2026 01:18

forsyth2 commented Feb 3, 2026

View reviewed changes

zstash/extract.py Show resolved Hide resolved

forsyth2 commented Feb 3, 2026

View reviewed changes

forsyth2 marked this pull request as ready for review February 3, 2026 21:25

forsyth2 mentioned this pull request Feb 24, 2026

Fix Python 3.14 compatibility: AbsoluteLinkError, pickling errors, config propagation, HPSS hang, and print_all_contents deadlock in extract.py and parallel.py #424

Closed

4 tasks

forsyth2 and others added 6 commits February 25, 2026 10:31

Implementation changes to support Python 3.14

8f479a6

Fixes to make Perlmutter tests pass

d4a0929

Claude's changes to reduce over-engineering

62df9c7

Restore comments and type annotations

6c9cf36

Restore heap-based worker allocation in multiprocess_extract

b4e9957

Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>

Fix pre-commit check

72ef3ac

forsyth2 force-pushed the add-python3.14 branch from c0498d3 to 72ef3ac Compare February 25, 2026 18:32

Copilot AI mentioned this pull request Feb 25, 2026

Add Python 3.14 support #425

Closed

16 tasks

forsyth2 assigned Copilot Feb 25, 2026

Address code review: remove wait_turn from extraction, fix HPSS auth,…

a87b57c

… manager cleanup, queue drain, precomputed tar index map Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>

forsyth2 requested a review from Copilot February 25, 2026 19:12

Copilot started reviewing on behalf of forsyth2 February 25, 2026 19:13 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

forsyth2 mentioned this pull request Feb 26, 2026

Add performance profiling as a standard zstash test #427

Draft

7 tasks

forsyth2 commented Feb 26, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 26, 2026

Add Python 3.14 support #428

Draft

7 tasks

forsyth2 commented Feb 27, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 28, 2026

Fix Python 3.14+ parallelism: force fork start method and fix tar.extract filter #429

Closed

4 tasks

	tar.extract(tarinfo, filter="tar")
	tar.extract(tarinfo, filter="data")

-            pass
+            pass
+        # Sleep briefly to avoid busy-waiting while worker processes run.
+        # This limits failure detection to ~100 checks per second (worst-case
+        # ~10 ms latency for very short-running jobs), which is an intentional
+        # trade-off for lower CPU usage in the main process.

		if self._current_tar_index.value == tar_index:
		self._current_tar_index.value += 1

-            if self._current_tar_index.value == tar_index:
-                self._current_tar_index.value += 1
+            current_index = self._current_tar_index.value
+            if current_index == tar_index:
+                self._current_tar_index.value = current_index + 1
+            else:
+                raise RuntimeError(
+                    "Attempted to advance tar index for tar {} from {} but "
+                    "current index is {}. This likely indicates that another "
+                    "worker has already advanced the tar index.".format(
+                        workers_curr_tar, tar_index, current_index
+                    )
+                )

Conversation

forsyth2 commented Nov 18, 2025

Summary

Small Change

Uh oh!

forsyth2 commented Nov 19, 2025

Uh oh!

forsyth2 commented Nov 19, 2025

The Core Problem

What's Happening

Why This Affects Python 3.14

The Failing Tests

The Fix

Uh oh!

forsyth2 commented Nov 19, 2025

Uh oh!

forsyth2 commented Nov 21, 2025

Python 3.14 Multiprocessing Output Race Condition

The Problem

Why Python 3.11-3.13 Worked

What Changed in Python 3.14

Why Our Fixes Didn't Work

The Fundamental Issue

Why This Is Hard To Fix

Recommendation

Uh oh!

forsyth2 commented Jan 27, 2026

Fix Python 3.14 multiprocessing output ordering race condition

Uh oh!

forsyth2 commented Jan 28, 2026

Uh oh!

forsyth2 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Guide: Python 3.14 Support PR

Overview

✅ Low-Risk Changes (Quick Review)

⚠️ High-Risk Changes (Requires Careful Testing)

Root Cause Fixed

Architecture Changes

Testing Checklist

Key Review Points

Context for Reviewers

Known Tradeoffs

Uh oh!

forsyth2 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forsyth2 commented Jan 28, 2026

Uh oh!

forsyth2 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forsyth2 commented Feb 2, 2026

Uh oh!

forsyth2 commented Feb 3, 2026

Uh oh!

forsyth2 commented Feb 3, 2026

Uh oh!

forsyth2 commented Feb 3, 2026

Summary: Python 3.14 Compatibility & Deadlock Fix

The Problem

The Solution

1. Database Connection Management

2. Synchronization Simplification

3. Defensive Measures

Implementation Details

PrintMonitor Changes

extractFiles() Changes

Worker Flow

Why This Approach

Impact

Trade-offs

Uh oh!

forsyth2 commented Feb 3, 2026

Uh oh!

Uh oh!

forsyth2 commented Feb 3, 2026

Uh oh!

forsyth2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

forsyth2 commented Jan 28, 2026 •

edited

Loading

forsyth2 commented Jan 28, 2026 •

edited

Loading

forsyth2 commented Jan 28, 2026 •

edited

Loading

`PrintMonitor` Changes

`extractFiles()` Changes

forsyth2 left a comment •

edited

Loading

forsyth2 Feb 25, 2026 •

edited

Loading