Conversation
|
Xylar had noted:
So, for the Spring release (tentatively planned as: testing period beginning April 1, release on May 1), we'd want to support four versions of Python: 3.11, 3.12, 3.13, 3.14. Tom also noted:
|
|
Unlike Claude's response: Looking at this GitHub Actions log for the zstash project running with Python 3.14, the main issue is a pickling error with SQLite cursor objects in multiprocessing. The Core ProblemThe tests are failing with this error: What's HappeningThe code is trying to use Python's Why This Affects Python 3.14Python 3.14 changed the default multiprocessing start method on Unix from The Failing TestsAll failures are in parallel execution tests:
The FixThe code needs to be updated to avoid passing SQLite cursor objects to worker processes. Instead, it should:
This is a compatibility issue introduced by Python 3.14's stricter multiprocessing requirements, not a bug in Python itself. |
|
This could end up more involved. The second commit still fails. Claude's analysis: Looking at this test failure log, the issue is clear: the tests are failing on Python 3.14 due to a multiprocessing serialization error. The key error appears multiple times: TypeError: cannot pickle 'weakref.ReferenceType' object
when serializing zstash.parallel.PrintMonitor object
when serializing zstash.parallel.ExtractWorker objectThis happens specifically when trying to use the
Why this is happening: Python 3.14 has changed how multiprocessing serializes objects. The What changed in Python 3.14: Python 3.14 made the default multiprocessing start method The fix: The zstash code needs to be updated to handle Python 3.14's stricter serialization requirements. Options include:
The test is correctly detecting this incompatibility - it's not a test issue, but a genuine compatibility problem with the parallel processing code in Python 3.14. |
d619ff1 to
1f1e8e1
Compare
|
I went through quite a few iterations with Claude and don't feel particularly closer to a solution. It's good we caught this well ahead of the Spring release (when we want to add Python 3.14 support). Important note from Claude's summary below:
That is, Claude's full summary: Python 3.14 Multiprocessing Output Race ConditionThe ProblemParallel extract tests pass on Python 3.11-3.13 but fail on Python 3.14 with output appearing in the wrong order, even though the code logic enforces correct sequencing. Why Python 3.11-3.13 WorkedIn Python 3.11-3.13, there was implicit synchronization that happened to preserve output ordering when multiple processes wrote to stdout:
This was accidental correctness - the code had a race condition, but Python's implementation details masked it. What Changed in Python 3.14Python 3.14 introduced several changes that exposed the race condition:
Why Our Fixes Didn't WorkWe tried multiple approaches, all of which failed:
The Fundamental Issuestdout is not atomic across processes in Python 3.14. Even when a single process (main) controls the printing order, the OS and Python's IO subsystem can still interleave the output when it's being captured/redirected. This is especially true when:
Why This Is Hard To FixThe only true solution would be to avoid stdout entirely and use a different IPC mechanism (like writing to a file with locks, or using a proper message queue library). But this would be a significant architectural change for what is fundamentally a Python 3.14 platform issue, not a logic bug in the code. RecommendationSkip these specific parallel output-ordering tests on Python 3.14 until:
The actual functionality works fine - files are extracted correctly, in parallel, with proper synchronization. Only the output ordering verification fails. |
|
Claude's description of the 3 commits from 2026-01-26 (4f172d5, 3f19ca5, 9f2692d) can be found below. The tests run via I still need to run the rest of the test suite though -- notably I see the GitHub Actions have caught another issue. Here's a concise commit message: Fix Python 3.14 multiprocessing output ordering race conditionPython 3.14's improved parallelism (PEP 703 free-threading) exposed a race condition in parallel extraction where output could arrive out of order. Key changes:
Flow:
This ensures ordered output in Python 3.14 while maintaining the same performance as Python 3.11-3.13. |
|
I've run the Chrysalis tests and they also pass using the latest commit. The only test I haven't run is the Globus authentication test, because it requires NERSC HPSS, which is down today (see NERSC Center Status). That said, there weren't any Globus changes in this PR, so I don't anticipate any issues there. |
|
Review guide from Claude: Output from ClaudeCode Review Guide: Python 3.14 Support PROverviewThis PR adds Python 3.14 support to zstash. Python 3.14's changes to multiprocessing (PEP 703 free-threading, improved parallelism) exposed race conditions in output ordering that were masked in 3.11-3.13 by GIL-based implicit synchronization. ✅ Low-Risk Changes (Quick Review)Version Updates - Straightforward:
Simple Bug Fix:
|
|
After some testing the branch at various stages, Claude was able to generate this summary table: Output from Claude
Key insight from Case 6: Removing all the verbose debug logging reveals the actual problem—the test is capturing output from two consecutive extraction runs (both the first Yes, the major refactor in Case 4 was necessary. Here's why: The Root Problem (revealed by Case 6): Why Cases 3, 5, and 6 all fail:
Why Case 4 succeeds:
Simply removing debug messages (Cases 5 & 6) only treated symptoms. The refactor in Case 4 addressed the architectural issue of how parallel workers coordinate their console output and/or how the test captures that output. Conclusion: Yes, the major refactor was necessary because the problem wasn't just verbose logging—it was a fundamental issue with output ordering/capture in the parallel execution model. Looking at the test file, I need to revise my previous answer. The test structure shows there are actually TWO separate
Both runs call Revised Understanding: The Root Problem:
But they appear interleaved: tar 0 (twice), then tar 1 (twice), then tar 2 (twice)... instead of all tar 0 mentions grouped together. Why Cases 3, 5, and 6 fail:
Each run individually has the wrong ordering! The workers are processing tars concurrently but their output isn't properly serialized. Why Case 4 succeeds: Conclusion: Yes, the major refactor was necessary. The problem wasn't about capturing two runs—that's intentional test design. The problem was that each individual parallel extraction run wasn't maintaining sequential tar ordering in its output, and the refactor fixed that core parallelization/synchronization issue. |
|
It turns out that case 3 was in fact simply failing because of the added debug lines. Two important lessons:
|
9ac2940 to
976397c
Compare
|
I've updated the commits. I removed the most recent 4 commits and added 2 commits: 137c101 removes extra write lines, 976397c fixes tar extraction (this is the same change as the 4th commit that was removed). The Chrysalis tests are passing. Remaining action items:
|
976397c to
3cdcdd3
Compare
3cdcdd3 to
734f200
Compare
|
I was able to remove more extraneous changes by reviewing the point in the commit history where the debug lines began interfering with the ordering tests. The 3 commits now are:
Commits 1 & 3 are known to be necessary. If any extraneous changes remain, they'd be in commit 2. |
|
Chrysalis tests pass as well. |
|
I think I've removed the over-engineering aspects. Remaining action items:
Claude's summary guide: Summary: Python 3.14 Compatibility & Deadlock FixThe ProblemTwo distinct issues blocking Python 3.14 support:
The Solution1. Database Connection ManagementBefore: Main process passed shared # In extractFiles():
if cur is None: # Worker process
con = sqlite3.connect(get_db_filename(cache))
cur = con.cursor()
close_at_end = True2. Synchronization SimplificationBefore: Complex flow with multiple wait points After: Single wait at start, then process entirely Key changes:
3. Defensive Measures
Implementation Details
|
To confirm this further, I went back through my Claude conversations from November. Re-applying the diffs confirms the following as the debugging workflow:
To summarize, it appears everything in commit 2 here really was necessary to 1) get all tests passing on Python 3.14 and 2) not cause deadlocks. |
There was a problem hiding this comment.
Here are explanatory comments from Claude re: the PR as these 5 diffs:
> git log --oneline | head -n 6
0ebd7a5 Restore comments and type annotations
f33ea80 Claude's changes to reduce over-engineering
734f200 Fixes to make Perlmutter tests pass
a54c842 Implementation changes to support Python 3.14
e315362 Add Python 3.14 support
880b9fd Optimize zstash update by using scandir (#420)
All tests pass on Perlmutter. I still need to check on Chrysalis, but based on previous iterations, I don't expect issues there.
EDIT: I've also tested on Chrysalis now.
Summary: The main theme is replacing complex load-balancing + synchronization with simple, predictable ordering. This eliminates deadlocks at the cost of potentially less optimal work distribution (but for most workloads, round-robin is fine).
zstash/extract.py
Outdated
|
|
||
| # For worker i, workers_to_tars[i] is a set of tars | ||
| # that worker i will work on. | ||
| # Round-robin assignment for predictable ordering |
There was a problem hiding this comment.
Simplified Work Distribution
Function: multiprocess_extract() - worker assignment logic
- Old approach: Used a min-heap to assign tars to workers based on total file size (trying to balance workload)
- New approach: Simple round-robin assignment (tar 0 → worker 0, tar 1 → worker 1, etc.)
- Why the change: The heap-based approach was complex and could result in unpredictable processing order, leading to deadlocks when workers waited for their turn to print. Round-robin is predictable and prevents deadlocks.
zstash/extract.py
Outdated
| # let the process know. | ||
| # This is to synchronize the print statements. | ||
|
|
||
| # Wait for turn before processing this tar |
There was a problem hiding this comment.
Extraction Loop Changes
Function: extractFiles() - main loop
- Added
wait_turn()call before processing each tar - ensures workers process tars in order - Moved
print_all_contents()to after marking tar as done - ensures all output is flushed before advancing - Added sleep in the monitoring loop (
time.sleep(0.01)) - prevents busy-waiting and gives workers CPU time - Added extra drain of failure queue after processes finish - catches any failures that arrived just as processes were ending
|
This PR adds Python 3.14 support and adds implementation changes to keep expected behavior while using Python 3.14. Most importantly, there are large changes to the parallelism workflow, notably replacing the heap method with the round-robin method (which Claude claims won't impact performance). These changes were ultimately necessary to keep the output ordering tests passing and to avoid deadlocks. It seems that Python 3.14 has changed some multiprocessing handling. I've put explanations from Claude on relevant code blocks above (see here). @golaz -- Do you think this change acceptable? It might be more efficient to go over this PR together at the next EZ discussion. @chengzhuzhang @tomvothecoder @andrewdnolan -- cc'ing you all since this PR resolves the issue of supporting Python 3.14 for the upcoming E3SM Unified release. Please let me know if you have any suggestions/concerns. Thanks all! |
Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>
c0498d3 to
72ef3ac
Compare
|
I rebased to include a fix specifically linked to the actual DevOps work and not the implementation changes |
|
@copilot I tried another approach to this work in #424. However that turned out to be less efficient, so I closed it. That PR did have some advantages though.
Please open a new pull request for this branch ( |
… manager cleanup, queue drain, precomputed tar index map Co-authored-by: forsyth2 <30700190+forsyth2@users.noreply.github.com>
|
Cherry-picked commit from #425. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cur is None: | ||
| con: sqlite3.Connection = sqlite3.connect( | ||
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | ||
| ) | ||
| cur = con.cursor() | ||
| close_db: bool = True | ||
| else: | ||
| close_db = False |
There was a problem hiding this comment.
Potential resource leak: If an exception occurs between lines 524-527 (when opening the database connection), the connection and cursor will not be properly closed because close_db is set to True on line 528, but if an exception happens before that line, the finally block or exception handler won't close the connection. Consider wrapping the connection opening in a try block or using a context manager to ensure proper cleanup.
| if cur is None: | |
| con: sqlite3.Connection = sqlite3.connect( | |
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | |
| ) | |
| cur = con.cursor() | |
| close_db: bool = True | |
| else: | |
| close_db = False | |
| con: Optional[sqlite3.Connection] = None | |
| close_db: bool = False | |
| if cur is None: | |
| try: | |
| con = sqlite3.connect( | |
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | |
| ) | |
| cur = con.cursor() | |
| close_db = True | |
| except Exception: | |
| if con is not None: | |
| con.close() | |
| raise |
| - pip | ||
| - python >=3.11,<3.14 | ||
| - python >=3.11,<3.15 | ||
| - setuptools |
There was a problem hiding this comment.
The addition of setuptools as an explicit dependency is good for Python 3.12+ compatibility, as setuptools is no longer automatically installed with pip. However, consider specifying a minimum version to ensure compatibility with the package's setup.py requirements. For example, setuptools >=65.0.0 would ensure modern setuptools features are available.
| - setuptools | |
| - setuptools >=65.0.0 |
| if tar in workers_to_tars[workers_idx]: | ||
| # This worker gets this db_row. | ||
| if db_row.tar in workers_to_tars[workers_idx]: | ||
| workers_to_matches[workers_idx].append(db_row) |
There was a problem hiding this comment.
The inner loop at lines 356-358 doesn't break after finding the worker that should process a tar. This means if a tar is somehow in multiple workers' sets, the db_row will be added to multiple workers' match lists, leading to duplicate processing. While this should not happen given the assignment logic, adding a break statement after line 358 would make the code more robust and efficient by avoiding unnecessary iterations once the correct worker is found.
| workers_to_matches[workers_idx].append(db_row) | |
| workers_to_matches[workers_idx].append(db_row) | |
| # Once the appropriate worker is found, no need to check others. | |
| break |
| elif extract_this_file: | ||
| tar.extract(tarinfo) | ||
| if sys.version_info >= (3, 12): | ||
| tar.extract(tarinfo, filter="tar") |
There was a problem hiding this comment.
The tarfile.extract filter parameter was introduced in Python 3.12 to address security concerns. However, the "tar" filter is deprecated as of Python 3.14 (PEP 706). For Python 3.14+ compatibility, consider using filter="data" instead, which is the recommended secure filter. The "data" filter is safer and future-proof. Alternatively, explicitly handle different Python versions if the "tar" filter's behavior is required.
| tar.extract(tarinfo, filter="tar") | |
| tar.extract(tarinfo, filter="data") |
There was a problem hiding this comment.
We must use filter="tar" or else we get this test failure:
======================================================================
FAIL: test_hpss_none_fs_off (tests.integration.python_tests.group_by_workflow.test_cache_fs.TestCacheFs.test_hpss_none_fs_off)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/global/u1/f/forsyth/ez/zstash/tests/integration/python_tests/group_by_workflow/test_cache_fs.py", line 152, in test_hpss_none_fs_off
self.assertTrue(os.path.islink("dir2/file1_soft.txt")) # DIFFERENT from fs_on
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 1.045s
FAILED (failures=1)
| # Open database connection if not provided (parallel case) | ||
| if cur is None: | ||
| con: sqlite3.Connection = sqlite3.connect( | ||
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | ||
| ) | ||
| cur = con.cursor() | ||
| close_db: bool = True | ||
| else: | ||
| close_db = False |
There was a problem hiding this comment.
The database cursor handling has been changed to open a new connection per worker process when running in parallel. This is necessary because sqlite3 connections and cursors are not thread-safe or process-safe. However, verify that all test cases pass with this change, particularly tests that involve concurrent database access, as this changes the connection model significantly.
| # Open database connection if not provided (parallel case) | |
| if cur is None: | |
| con: sqlite3.Connection = sqlite3.connect( | |
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | |
| ) | |
| cur = con.cursor() | |
| close_db: bool = True | |
| else: | |
| close_db = False | |
| # Open database connection if not provided, or always in parallel. | |
| con: Optional[sqlite3.Connection] | |
| close_db: bool = False | |
| if multiprocess_worker is not None: | |
| # In parallel, each worker must use its own sqlite3 connection | |
| # because sqlite3 connections and cursors are not safe to share | |
| # between processes. | |
| con = sqlite3.connect( | |
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | |
| ) | |
| cur = con.cursor() | |
| close_db = True | |
| elif cur is None: | |
| con = sqlite3.connect( | |
| get_db_filename(cache), detect_types=sqlite3.PARSE_DECLTYPES | |
| ) | |
| cur = con.cursor() | |
| close_db = True |
| # Close database connection if we opened it | ||
| if close_db: | ||
| cur.close() | ||
| con.close() |
There was a problem hiding this comment.
The database connection cleanup at lines 742-744 only happens if the function exits normally. If an exception occurs in the try block (lines 629-712) or elsewhere in the function, the database connection will not be closed, leading to a resource leak. Consider wrapping the database operations in a try-finally block or using a context manager to ensure the connection is always closed when opened by this function.
| while True: | ||
| failures.append(failure_queue.get_nowait()) | ||
| except queue.Empty: | ||
| pass |
There was a problem hiding this comment.
Adding a sleep(0.01) in the main process loop is good for reducing CPU usage while waiting for worker processes. However, this means the main process will check for failures at most 100 times per second. For very short-running jobs, this could add up to 10ms of latency before failures are detected. This is likely acceptable, but consider documenting this trade-off.
| pass | |
| pass | |
| # Sleep briefly to avoid busy-waiting while worker processes run. | |
| # This limits failure detection to ~100 checks per second (worst-case | |
| # ~10 ms latency for very short-running jobs), which is an intentional | |
| # trade-off for lower CPU usage in the main process. |
| if self._current_tar_index.value == tar_index: | ||
| self._current_tar_index.value += 1 |
There was a problem hiding this comment.
Potential race condition: The check at line 87 verifies if the current tar index matches, but there's no guarantee that another worker hasn't already incremented it between the check and the increment. While the lock protects the increment operation itself, if multiple workers call this method for the same tar_index concurrently, only the first one will increment the counter, and the others will be silently ignored. Consider adding an else clause to detect when the counter has already been incremented, or verify this is the intended behavior and add a comment explaining why silent failure is acceptable.
| if self._current_tar_index.value == tar_index: | |
| self._current_tar_index.value += 1 | |
| current_index = self._current_tar_index.value | |
| if current_index == tar_index: | |
| self._current_tar_index.value = current_index + 1 | |
| else: | |
| raise RuntimeError( | |
| "Attempted to advance tar index for tar {} from {} but " | |
| "current index is {}. This likely indicates that another " | |
| "worker has already advanced the tar index.".format( | |
| workers_curr_tar, tar_index, current_index | |
| ) | |
| ) |
| # Wait 0.001 to see if it's the worker's turn. | ||
| self._cv.wait(0.001) | ||
| attempted = True | ||
| time.sleep(0.01) |
There was a problem hiding this comment.
The polling interval was changed from 0.001 seconds (1 millisecond) to 0.01 seconds (10 milliseconds). This is a 10x increase in the polling interval. While this reduces CPU usage, it may increase latency in task switching between workers. For workloads with many small tar files, this could impact performance. Consider adding a comment explaining this trade-off, or making the polling interval configurable if performance becomes an issue.
| except queue.Empty: | ||
| pass | ||
|
|
||
| manager.shutdown() |
There was a problem hiding this comment.
The manager.shutdown() call may prevent proper cleanup of failures. This is called immediately after collecting failures from processes, but the PrintMonitor and ExtractWorker objects may still hold references to manager-created objects like locks and values. Consider moving the shutdown call after all cleanup is complete, or verify that all worker references to manager objects have been released. Additionally, verify that manager.shutdown() is the correct method - it may need to be manager.join() depending on the multiprocessing.Manager() implementation.
|
Following up on the comparison with #424 in #402 (comment): Diagrams explaining the changes, from ClaudeLooking at these diffs, the two PRs share some changes (both replace
The root cause is Python 3.14 switching from This broke two things: Bug 1 — SQLite cursor. The old code passed Bug 2 — PrintMonitor internals. The old Where they diverge is HPSS authentication. With spawn, worker processes have no Kerberos credentials. PR #402 tries to work around this by reading |
|
Note that a full explainer on parallelism in |
Comparing parallelismPerformance dataUsing the scripts from #427. Caveats:
Results from this PR (#402): Compared to main, PR 402 ran 1.28x longer on extracting an archive containing Results from the Copilot approach (#424): Compared to main, PR 424 ran 1.13x longer on extracting an archive containing Other notesI remain highly skeptical of the approach in #424 as it relied on prefetching tar files before deploying workers. But I am concerned about that 1.28x speedup mentioned above for this PR. |
| work_to_workers: List[Tuple[int, int]] = [(0, i) for i in range(num_workers)] | ||
| heapq.heapify(workers_to_tars) | ||
| work_to_workers: List[Tuple[float, int]] = [(0.0, i) for i in range(num_workers)] | ||
| heapq.heapify(work_to_workers) |
There was a problem hiding this comment.
This was apparently a mistake that was fixed in this PR (see #426). That is, it should have always been work_to_workers here, despite that line being added over 6 years ago.
|
@copilot Please address all the comments from the code review agent unless I've specifically noted it as an unnecessary change. |
|
I asked Claude to generate a concise summary explaining this work (see below the divider bar). Notes:
Adding Python 3.14 Support to
|
| PR 402 / 428 | PR 424 | |
|---|---|---|
| HPSS strategy | Workers still attempt HPSS downloads; reads args.hpss as a fallback |
Main process pre-downloads all needed tars before workers start |
| Auth concern | Workers may still lack credentials for HPSS | Workers never touch HPSS — tars are already on disk |
| Conceptual approach | Fix the symptoms iteratively | Sidestep the problem entirely |
PR 424's _prefetch_tars_for_parallel() function is the key addition: the main process (which does have HPSS credentials) downloads everything up front, so workers only ever read from local disk.
However, this approach has a significant concern: prefetching is sequential. All tars are downloaded one by one in the main process before a single worker is spawned. At real-world zstash scales — archives in the tens of terabytes — this could mean waiting a very long time before any parallel extraction begins, undermining the whole point of using multiple workers. PR 402/428's approach of having workers handle their own downloads at least keeps the download and extraction pipeline moving concurrently, even if the HPSS auth situation is messier.
Performance
All numbers below are from the --hpss=none --num_workers=2 case (local cache, parallel extraction). Note: these are single runs, so treat them as directional, not definitive.
| Branch | Notable result |
|---|---|
| PR 402 | Up to 1.28× slower than main on small-file-heavy archives (build/ + run/) |
| PR 424 | Up to 1.13× slower than main on archives with large files (init/ + others) |
| PR 428 (402 + code review fixes) | Recovered some of PR 402's regression — init/ + build/ ran in 0.75× the time of PR 402 |
The slowdowns are modest and likely stem from the overhead of manager-based synchronization (socket round-trips instead of shared memory). At real-world scales (10–20 TB), the HPSS transfer time will dominate and these differences should become negligible — though as noted above, PR 424's sequential prefetch may introduce a more serious bottleneck at that scale that these small benchmarks don't capture.
Summary
Python 3.14's switch to spawn-by-default meant workers could no longer silently inherit state from the parent process. The fixes required making all inter-process communication explicitly serializable. The two PRs agree on the core fixes; they differ mainly in how they handle HPSS downloads. PR 424's prefetch-everything-first approach is conceptually clean but raises concerns about sequential bottlenecks at scale. PR 402/428 keeps downloads concurrent with extraction but has a messier relationship with HPSS authentication in spawned workers. Neither approach has been benchmarked at production scale, and that's probably the most important open question before merging.
| failures.append(failure_queue.get_nowait()) | ||
| except queue.Empty: | ||
| pass | ||
| time.sleep(0.01) |
There was a problem hiding this comment.
Another note from Claude, after mentioning "We routinely extract archives 10-20 TB in size. The default individual tar size is 256 GB."
With 40-80 tars running for hours each, this loop spins ~360,000 times per hour doing essentially nothing but checking is_alive(). That's harmless on its own, but consider: if a worker crashes without putting anything in failure_queue, the main process has no timeout mechanism. It will spin forever. At 10-20 TB job sizes, a silent worker death — due to OOM, a storage node going down, an HPSS timeout — is not a hypothetical.








Summary
Important: this should be merged to
mainafter the production release ofzstash v1.5.0.Objectives:
Issue resolution:
Select one: This pull request is...
Small Change