Skip to content

fix(tools): auto-batch long time series to prevent MPI/BLAS deadlock#11

Closed
yafshar wants to merge 4 commits intomainfrom
fix/mpi-deadlock-auto-batch
Closed

fix(tools): auto-batch long time series to prevent MPI/BLAS deadlock#11
yafshar wants to merge 4 commits intomainfrom
fix/mpi-deadlock-auto-batch

Conversation

@yafshar
Copy link
Copy Markdown
Member

@yafshar yafshar commented Mar 24, 2026

Summary

Running kim-convergence with LAMMPS on 20+ MPI ranks hangs indefinitely when time series exceed ~5 million samples. GDB traces show:

  • Parent process blocked in wait4() (Python multiprocessing)
  • Child processes deadlocked on semaphores/locks inside BLAS FFT operations
  • Triggered at exactly 5.36M steps in production NPT simulations

Root cause: NumPy/BLAS threading conflicts within MPI context when computing autocovariance/periodograms on large arrays.

Solution

Automatic batching (block-averaging) when KIM_CONV_MAX_TSD_LENGTH is exceeded.

Changes

  • tools.py: Add _auto_batch() helper using existing batch() utility
  • Auto-batching applied to:
    • auto_covariance()
    • cross_covariance()
    • modified_periodogram()
  • Environment variable KIM_CONV_MAX_TSD_LENGTH controls limit (default: 5M, set to 0 to disable)

Why this works

  • Preserves mean and low-frequency correlations (what matters for SI estimation)
  • Reduces array size before FFT/correlation calls
  • No data loss: all samples contribute via averaging
  • Zero overhead for short series (<5M)

Testing

# Default 5M limit
mpirun -np 20 lmp -in in.lammps

# Stricter limit (3M)
export KIM_CONV_MAX_TSD_LENGTH=3000000
mpirun -np 20 lmp -in in.lammps

# Disable (original behavior - may deadlock)
export KIM_CONV_MAX_TSD_LENGTH=0

Related

Summary by CodeRabbit

  • New Features

    • Environment variable KIM_CONV_MAX_TSD_LENGTH to cap processed time-series length (0 disables).
    • Automatic batching of oversized time-series across covariance/correlation/periodogram workflows, with a single user-facing warning when batching alters effective batch sizes.
  • Documentation

    • Updated troubleshooting: targets MPI high-step-count hangs, recommends variable ordering, and keeps subprocess (multiprocessing spawn) fallback as an escalation.

Time series >5M samples trigger deadlocks in FFT/correlation routines when
running under MPI (observed at ~5.36M steps with 20 MPI ranks). Root cause
is BLAS threading inside MPI context.

We automatically batch (block-average) arrays exceeding safe length
using batch() utility. Preserves mean and low-frequency statistics
required for SI estimation while avoiding O(N^2) memory/thread explosion.

- Add KIM_CONV_MAX_TSD_LENGTH env var (default 5M, set 0 to disable)
- Apply auto-batching in auto_covariance, cross_covariance, and
  modified_periodogram before heavy computations
- Uses simple mean-reduction; remainder points discarded per standard
  batch() behavior

Relates-to: refactor/lammps-example-mpi-support
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds environment-controlled auto-batching for long 1D time-series in kim_convergence/stats/tools.py: parses KIM_CONV_MAX_TSD_LENGTH, tracks last warned batch size, implements _auto_batch(x) to block-average long inputs, and applies batching in several correlation/covariance/periodogram routines; documentation updated accordingly.

Changes

Cohort / File(s) Summary
Auto-batching core
kim_convergence/stats/tools.py
Added parsing/validation of KIM_CONV_MAX_TSD_LENGTH, module state _AUTO_BATCH_LAST_WARNED_SIZE, import of cr_warning and batch, and new helper _auto_batch(x) that computes a ceiling batch_size, issues size-change warnings once, and returns batch(..., func=np.mean) when applied.
Function integrations & validation
kim_convergence/stats/tools.py
Integrated _auto_batch into auto_covariance, cross_covariance, auto_correlate, cross_correlate, modified_periodogram, and periodogram so inputs are reduced before mean-centering/covariance/correlation/periodogram work. periodogram now validates x before batching and uses x.size (post-batch) for size calculations.
Documentation
doc/troubleshooting.rst
Rewrote the “Deadlock or hang inside simulations” section to target MPI/high-step-count hangs, documented KIM_CONV_MAX_TSD_LENGTH auto-batching behavior (including 0 to disable and examples), and repositioned KIM_CONV_FORCE_SUBPROC=1 as an escalation that uses multiprocessing spawn.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled long time-series down to size,
A mean in every burrow, tidy and wise,
Env sets the limit — batch or stay,
Warnings whisper only when sizes sway,
Hopping on, the FFT dreams, quiet and spry.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the primary change: adding auto-batching for long time series to prevent MPI/BLAS deadlock.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/mpi-deadlock-auto-batch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
kim_convergence/stats/tools.py (2)

131-132: Consider adding error handling for invalid environment variable values.

If KIM_CONV_MAX_TSD_LENGTH is set to a non-integer string (e.g., "abc"), int() will raise an unhandled ValueError at module import time, which may be confusing for users. A graceful fallback with a warning would improve robustness.

♻️ Proposed fix with error handling
 # Auto-batching configuration
-_MAX_TSD_LENGTH = int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000"))
+_MAX_TSD_LENGTH_DEFAULT = 5000000
+try:
+    _MAX_TSD_LENGTH = int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", str(_MAX_TSD_LENGTH_DEFAULT)))
+except ValueError:
+    _MAX_TSD_LENGTH = _MAX_TSD_LENGTH_DEFAULT
+    # Warning will be issued on first use since cr_warning may not be safe at import time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 131 - 132, The module-level
constant _MAX_TSD_LENGTH is currently set via
int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000")) and will raise a
ValueError at import if the env var is non-numeric; update the assignment to
catch conversion errors: read the env var using os.environ.get, try to int() it
inside a try/except (catch ValueError/TypeError), log or warn about the invalid
value and fall back to the default 5_000_000, and ensure _MAX_TSD_LENGTH is
always assigned a valid int; reference the _MAX_TSD_LENGTH assignment in
tools.py and any module-level import-time initialization when making the change.

227-230: Warning may be emitted repeatedly in iterative workflows.

If a user processes multiple large time series in a loop (or calls these functions repeatedly), the warning will be emitted for each call, potentially flooding logs. Consider adding a mechanism to suppress repeated warnings or emit only once per session.

♻️ Optional: Add a "warned once" flag
+_auto_batch_warned = False
+
 def _auto_batch(x: np.ndarray) -> np.ndarray:
+    global _auto_batch_warned
     ...
     if _MAX_TSD_LENGTH <= 0 or x.size <= _MAX_TSD_LENGTH:
         return x
 
     batch_size = (x.size + _MAX_TSD_LENGTH - 1) // _MAX_TSD_LENGTH
 
-    cr_warning(
-        f"Time series length ({x.size}) exceeds safe limit "
-        f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}."
-    )
+    if not _auto_batch_warned:
+        cr_warning(
+            f"Time series length ({x.size}) exceeds safe limit "
+            f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}."
+        )
+        _auto_batch_warned = True
 
     return batch(x, batch_size=batch_size, func=np.mean)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 227 - 230, The cr_warning call
inside the large time-series handling emits the same message every time (when
x.size > _MAX_TSD_LENGTH); to avoid log flooding, guard that call with a
module-level "warned once" flag or set (e.g. _TS_LENGTH_WARNING_EMITTED) and
only call cr_warning the first time, then mark the flag true; locate the site
referencing _MAX_TSD_LENGTH, x.size, batch_size and replace the unconditional
cr_warning(...) with an if not _TS_LENGTH_WARNING_EMITTED: cr_warning(...) and
set _TS_LENGTH_WARNING_EMITTED = True (or alternatively switch to Python's
warnings.warn with once filter).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 131-132: The module-level constant _MAX_TSD_LENGTH is currently
set via int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000")) and will raise
a ValueError at import if the env var is non-numeric; update the assignment to
catch conversion errors: read the env var using os.environ.get, try to int() it
inside a try/except (catch ValueError/TypeError), log or warn about the invalid
value and fall back to the default 5_000_000, and ensure _MAX_TSD_LENGTH is
always assigned a valid int; reference the _MAX_TSD_LENGTH assignment in
tools.py and any module-level import-time initialization when making the change.
- Around line 227-230: The cr_warning call inside the large time-series handling
emits the same message every time (when x.size > _MAX_TSD_LENGTH); to avoid log
flooding, guard that call with a module-level "warned once" flag or set (e.g.
_TS_LENGTH_WARNING_EMITTED) and only call cr_warning the first time, then mark
the flag true; locate the site referencing _MAX_TSD_LENGTH, x.size, batch_size
and replace the unconditional cr_warning(...) with an if not
_TS_LENGTH_WARNING_EMITTED: cr_warning(...) and set _TS_LENGTH_WARNING_EMITTED =
True (or alternatively switch to Python's warnings.warn with once filter).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9701348f-9769-4db6-9943-894cc44698ee

📥 Commits

Reviewing files that changed from the base of the PR and between 276effd and 6644d5b.

📒 Files selected for processing (1)
  • kim_convergence/stats/tools.py

@yafshar
Copy link
Copy Markdown
Member Author

yafshar commented Mar 24, 2026

@ilia-nikiforov-umn Another fix is ready for testing. Can you try it on your production LAMMPS case that was hanging at 5M steps?

Quick test:

# Fetch latest
git fetch origin
# Start from your LAMMPS example branch
git checkout refactor/lammps-example-mpi-support
# Create test branch with the fix merged on top
git checkout -b test-lammps-mpi-fix
git merge origin/fix/mpi-deadlock-auto-batch
# Install and run
pip install -e .
mpirun -np 20 lmp ...

If it works: You'll see it pass 5.36M steps without hanging. You might see warnings like:

Time series length (5360000) exceeds safe limit (5000000). Auto-batching with batch_size=2.

That's expected and the protection is working.
If it still hangs:

export KIM_CONV_MAX_TSD_LENGTH=3000000  # Try stricter limit
mpirun -np 20 lmp ...

Let me know if it completes the full run or if you still see deadlocks.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
kim_convergence/stats/tools.py (1)

247-252: Warning deduplication may re-warn when batch sizes alternate.

The current logic warns when batch_size != _AUTO_BATCH_LAST_WARNED_SIZE. If processing alternates between series requiring different batch sizes (e.g., batch_size=2, then 3, then 2 again), the warning will repeat for the same batch_size values.

Consider tracking a set of already-warned sizes if stricter deduplication is desired. However, the current approach is reasonable for the primary use case of avoiding log flooding in loops with similar-sized data.

♻️ Optional: Use a set for stricter deduplication
-# Module-level tracking
-_AUTO_BATCH_LAST_WARNED_SIZE: int = 0
+# Module-level tracking
+_AUTO_BATCH_WARNED_SIZES: set[int] = set()


 def _auto_batch(x: np.ndarray) -> np.ndarray:
     ...
-    global _AUTO_BATCH_LAST_WARNED_SIZE
+    global _AUTO_BATCH_WARNED_SIZES
     ...
-    # Warn only when batch_size changes to avoid log flooding
-    if batch_size != _AUTO_BATCH_LAST_WARNED_SIZE:
+    # Warn only once per batch_size to avoid log flooding
+    if batch_size not in _AUTO_BATCH_WARNED_SIZES:
         cr_warning(
             f"Time series length ({x.size}) exceeds safe limit "
             f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}."
         )
-        _AUTO_BATCH_LAST_WARNED_SIZE = batch_size
+        _AUTO_BATCH_WARNED_SIZES.add(batch_size)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 247 - 252, The warning
deduplication currently compares batch_size against _AUTO_BATCH_LAST_WARNED_SIZE
which will re-warn if sizes alternate; change this to track a set of warned
sizes (e.g., _AUTO_BATCH_WARNED_SIZES) and only call cr_warning when batch_size
is not in that set, then add batch_size to the set; update any initialization of
_AUTO_BATCH_LAST_WARNED_SIZE and uses around the block referencing batch_size,
cr_warning, and _MAX_TSD_LENGTH to use the new set-based deduplication to avoid
repeated warnings for previously seen sizes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 247-252: The warning deduplication currently compares batch_size
against _AUTO_BATCH_LAST_WARNED_SIZE which will re-warn if sizes alternate;
change this to track a set of warned sizes (e.g., _AUTO_BATCH_WARNED_SIZES) and
only call cr_warning when batch_size is not in that set, then add batch_size to
the set; update any initialization of _AUTO_BATCH_LAST_WARNED_SIZE and uses
around the block referencing batch_size, cr_warning, and _MAX_TSD_LENGTH to use
the new set-based deduplication to avoid repeated warnings for previously seen
sizes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13d10682-0192-49de-b930-c971b11f102f

📥 Commits

Reviewing files that changed from the base of the PR and between 6644d5b and d437fd5.

📒 Files selected for processing (1)
  • kim_convergence/stats/tools.py

@yafshar yafshar force-pushed the fix/mpi-deadlock-auto-batch branch from d437fd5 to 84bd344 Compare March 25, 2026 12:07
@yafshar yafshar force-pushed the fix/mpi-deadlock-auto-batch branch from 84bd344 to 80c552a Compare March 25, 2026 12:09
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 461-462: Batching inside helper layers (_auto_batch called where x
is used) causes downstream functions to use original lengths/stats and produce
inconsistent normalization/lag limits; move batching to the public entry points
(auto_correlate, cross_correlate, periodogram) or else ensure the batched arrays
and their effective length/std are propagated through all downstream helpers.
Specifically, remove/avoid calling _auto_batch inline in lower-level helpers and
instead call _auto_batch at the start of auto_correlate, cross_correlate, and
periodogram (or modify their signatures to accept both the processed array and
its effective length/std), then compute nlags, normalization (std(x)*std(y)),
and Nyquist/rescaling using the batched/propagated values so all size-sensitive
logic uses the effective series size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0423b226-de87-4c57-9ea5-5d39eea4d1d1

📥 Commits

Reviewing files that changed from the base of the PR and between d437fd5 and 80c552a.

📒 Files selected for processing (2)
  • doc/troubleshooting.rst
  • kim_convergence/stats/tools.py

Comment thread kim_convergence/stats/tools.py
Batching in low-level primitives (auto_covariance, cross_covariance) broke
downstream length accounting:

- auto_correlate() capped nlags against unbatched x.size
- cross_correlate() normalized with original std(x)*std(y) while cov used batched arrays
- periodogram() used original length for Nyquist scaling after FFT on batched data
- cross_correlate(x, x.copy())[0] could return < 1 due to size mismatch

Fix: Move batching to public entry points (auto_correlate, cross_correlate,
periodogram, modified_periodogram) so all size-sensitive logic (nlags capping,
std normalization, Nyquist handling) uses the consistent batched length.

Low-level primitives (auto_covariance, cross_covariance) no longer auto-batch;
they assume inputs are already managed by callers.

Also applies to: modified_periodogram batching location.
@ilia-nikiforov-umn
Copy link
Copy Markdown
Member

@yafshar it is still hanging. I merged this branch up to 6644d5b into the latest refactor/lammps-example-mpi-support branch. I tried both the default setting and setting export KIM_CONV_MAX_TSD_LENGTH=3000000

However, when I set the environment variable, I noticed that although the warning messages had the correct value:
WARNING(@_auto_batch): Time series length (5350001) exceeds safe limit (3000000). Auto-batching with batch_size=2.

Setting the environment variable did not change the point in the simulation at which they started appearing. They warnings only started getting printed after step 5360000, when I would have expected them to start printing much earlier.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
kim_convergence/stats/tools.py (1)

222-255: ⚠️ Potential issue | 🟠 Major

Expose the effective batching metadata.

Once _auto_batch() averages blocks of size b, auto_correlate()[k] / cross_correlate()[k] are at original lag k*b, and the periodogram routines operate on a series sampled every b steps. kim_convergence/batch.py:92-122 also truncates the tail remainder, so the effective processed length changes as well. The only place batch_size survives here is the warning text, which is deduped and not part of the API. Please surface batch_size / processed length (or return the axis alongside the values), otherwise callers cannot recover the correct lag or frequency coordinates once batching triggers.

Also applies to: 540-540, 604-605, 680-682, 750-753

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 222 - 255, The _auto_batch
function currently drops batch_size metadata (only logging it) which prevents
callers like auto_correlate, cross_correlate and periodogram from reconstructing
true lag/frequency axes; update _auto_batch (and other call sites noted) to
return both the processed array and a small metadata tuple/object containing
batch_size and processed_length (or alternately return (values, axis_scale)
where axis_scale == batch_size and processed_length == values.size), update
callers (auto_correlate, cross_correlate, periodogram) to accept the (values,
meta) return and use meta.batch_size (or meta.axis_scale) to scale
lags/frequencies and account for dropped tail points (batch truncation) so
coordinates map back to the original series; keep backward compatibility by
providing a single-value return when no batching occurred (or document the new
return shape) and preserve use of _AUTO_BATCH_LAST_WARNED_SIZE for logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 39-50: The docstring claims KIM_CONV_MAX_TSD_LENGTH applies to
autocovariance and cross-covariance, but auto_covariance() and
cross_covariance() do not call _auto_batch() and currently pass large arrays
straight into _fft_corr(); update the implementation so auto_covariance(...) and
cross_covariance(...) invoke the same batching routine (_auto_batch) used by the
correlator/periodogram entry points before calling _fft_corr(), ensuring inputs
over KIM_CONV_MAX_TSD_LENGTH are block-averaged, or alternatively narrow the
docstring to only mention the correlator/periodogram entry points if you prefer
not to change behavior. Ensure you reference the functions auto_covariance,
cross_covariance, _auto_batch, and _fft_corr when making the change.

---

Duplicate comments:
In `@kim_convergence/stats/tools.py`:
- Around line 222-255: The _auto_batch function currently drops batch_size
metadata (only logging it) which prevents callers like auto_correlate,
cross_correlate and periodogram from reconstructing true lag/frequency axes;
update _auto_batch (and other call sites noted) to return both the processed
array and a small metadata tuple/object containing batch_size and
processed_length (or alternately return (values, axis_scale) where axis_scale ==
batch_size and processed_length == values.size), update callers (auto_correlate,
cross_correlate, periodogram) to accept the (values, meta) return and use
meta.batch_size (or meta.axis_scale) to scale lags/frequencies and account for
dropped tail points (batch truncation) so coordinates map back to the original
series; keep backward compatibility by providing a single-value return when no
batching occurred (or document the new return shape) and preserve use of
_AUTO_BATCH_LAST_WARNED_SIZE for logging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 23627af3-f22a-4b4c-8385-6032176de5a7

📥 Commits

Reviewing files that changed from the base of the PR and between 80c552a and 9d940be.

📒 Files selected for processing (1)
  • kim_convergence/stats/tools.py

Comment thread kim_convergence/stats/tools.py
Add _auto_batch() protection to auto_covariance() and cross_covariance()
in addition to the existing entry points (auto_correlate, cross_correlate,
modified_periodogram, periodogram).

This ensures direct callers of the covariance primitives (e.g., statistical
inefficiency calculations) are also protected from >5M MPI/BLAS deadlocks.

Idempotent batching prevents double-batching when correlators internally
call covariance functions—the second check is a cheap no-op if the array
is already within limits.

All public statistical APIs now safely handle arbitrarily long time series
while maintaining consistent normalization in downstream calculations.
@yafshar
Copy link
Copy Markdown
Member Author

yafshar commented Mar 25, 2026

@yafshar it is still hanging. I merged this branch up to 6644d5b into the latest refactor/lammps-example-mpi-support branch. I tried both the default setting and setting export KIM_CONV_MAX_TSD_LENGTH=3000000

However, when I set the environment variable, I noticed that although the warning messages had the correct value: WARNING(@_auto_batch): Time series length (5350001) exceeds safe limit (3000000). Auto-batching with batch_size=2.

Setting the environment variable did not change the point in the simulation at which they started appearing. They warnings only started getting printed after step 5360000, when I would have expected them to start printing much earlier.

Can you try again with the latest commit.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
kim_convergence/stats/tools.py (1)

149-150: The warning de-dup key is too coarse.

_AUTO_BATCH_LAST_WARNED_SIZE is process-global and only keyed on batch_size. After one batch_size=2 warning, every later input that also batches by 2 will go silent, even if it comes from a different API or a different original size. That is broader than the nested duplicate you are trying to suppress (auto_correlateauto_covariance, periodogrammodified_periodogram) and makes the trigger point much harder to verify in MPI repros. Prefer per-call-chain suppression, or key the suppression on more than just batch_size.

Also applies to: 247-253

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 149 - 150,
_AUTO_BATCH_LAST_WARNED_SIZE is too coarse; change the suppression to be keyed
per call-chain (or include more context) instead of a single process-global int.
Replace the int with a mapping keyed by a tuple that includes batch_size plus an
identifier for the call-chain/source (for example function names like
auto_correlate, auto_covariance, periodogram, modified_periodogram or a
generated call-chain token), and update the warning logic to consult/update that
map; alternatively use a ContextVar or thread-local keyed by call-chain to
achieve per-call-chain suppression. Apply the same change where the same pattern
appears (lines around the other occurrence noted).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 546-550: The call to _auto_batch is happening before validation,
which allows oversized or malformed inputs to reach batch() and bypass CRError
checks; change the order in auto_covariance and cross_covariance so that you
first run _validate_and_prepare_input (or the existing validation routine used
elsewhere) on x (and y where applicable) to enforce
dimensionality/shape/finiteness, then convert to np.asarray(dtype=np.float64) if
needed, and only then call _auto_batch on the already-validated arrays; apply
the same reorder to the other occurrence around lines 610-615 to ensure
validation always precedes batching.
- Around line 223-255: The public APIs currently call _auto_batch which silently
changes the effective sampling stride (batch_size) so outputs from
auto_covariance, cross_covariance, modified_periodogram, and periodogram no
longer map to the original lag/frequency units; either keep _auto_batch usage
confined to internal convergence-only paths or change the public function
signatures to propagate the effective stride: detect batch_size inside
_auto_batch and return it (or add an optional out parameter/tuple return like
(result, stride)), update auto_covariance, cross_covariance,
modified_periodogram, and periodogram to accept/return the stride (and document
the new return shape), update all call sites and tests accordingly, and preserve
the existing warning behavior via _AUTO_BATCH_LAST_WARNED_SIZE so callers can
convert lag/frequency bins back to original units.

---

Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 149-150: _AUTO_BATCH_LAST_WARNED_SIZE is too coarse; change the
suppression to be keyed per call-chain (or include more context) instead of a
single process-global int. Replace the int with a mapping keyed by a tuple that
includes batch_size plus an identifier for the call-chain/source (for example
function names like auto_correlate, auto_covariance, periodogram,
modified_periodogram or a generated call-chain token), and update the warning
logic to consult/update that map; alternatively use a ContextVar or thread-local
keyed by call-chain to achieve per-call-chain suppression. Apply the same change
where the same pattern appears (lines around the other occurrence noted).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 163c3119-35e8-46fb-8fa1-3d6162be5d17

📥 Commits

Reviewing files that changed from the base of the PR and between 9d940be and 5068290.

📒 Files selected for processing (1)
  • kim_convergence/stats/tools.py

Comment on lines +223 to +255
def _auto_batch(x: np.ndarray) -> np.ndarray:
r"""Auto-batch time series data if it exceeds safe length limit.

Uses non-overlapping block averaging to reduce array length while
preserving mean and low-frequency statistics needed for convergence
analysis. Remainder points at the end are discarded (standard batch
behavior).

Args:
x (1darray): Input time series data.

Returns:
1darray: Original array if length <= limit, otherwise batched
(down-sampled) array with length <= _MAX_TSD_LENGTH.
"""
global _AUTO_BATCH_LAST_WARNED_SIZE

if _MAX_TSD_LENGTH <= 0 or x.size <= _MAX_TSD_LENGTH:
return x

# Calculate batch size to ensure result fits within limit
# Ceiling division: (n + limit - 1) // limit
batch_size = (x.size + _MAX_TSD_LENGTH - 1) // _MAX_TSD_LENGTH

# Warn only when batch_size changes to avoid log flooding
if batch_size != _AUTO_BATCH_LAST_WARNED_SIZE:
cr_warning(
f"Time series length ({x.size}) exceeds safe limit "
f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}."
)
_AUTO_BATCH_LAST_WARNED_SIZE = batch_size

return batch(x, batch_size=batch_size, func=np.mean)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep auto-batching out of the exact public stats APIs, or expose the stride.

Once _auto_batch() runs, these functions are no longer computing the covariance/correlation/periodogram of the original series; they are computing it for the block-averaged series. That means lag/frequency bin k now corresponds to k * batch_size on the original sampling grid, but none of these exported APIs return batch_size or any equivalent stride. Callers cannot recover the original units, so this is a semantic break for auto_covariance(), cross_covariance(), modified_periodogram(), and periodogram(), not just an implementation detail. Please either keep batching inside convergence-specific internal paths, or propagate the effective stride alongside the result.

Also applies to: 462-463, 511-513, 690-691, 762-763

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 223 - 255, The public APIs
currently call _auto_batch which silently changes the effective sampling stride
(batch_size) so outputs from auto_covariance, cross_covariance,
modified_periodogram, and periodogram no longer map to the original
lag/frequency units; either keep _auto_batch usage confined to internal
convergence-only paths or change the public function signatures to propagate the
effective stride: detect batch_size inside _auto_batch and return it (or add an
optional out parameter/tuple return like (result, stride)), update
auto_covariance, cross_covariance, modified_periodogram, and periodogram to
accept/return the stride (and document the new return shape), update all call
sites and tests accordingly, and preserve the existing warning behavior via
_AUTO_BATCH_LAST_WARNED_SIZE so callers can convert lag/frequency bins back to
original units.

Comment on lines 546 to +550
x = np.asarray(x, dtype=np.float64)

# Auto-batch if series exceeds safe length
x = _auto_batch(x)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate before these _auto_batch() calls.

auto_covariance() / cross_covariance() already batch after _validate_and_prepare_input(). Doing it again here means oversized malformed inputs hit batch() before the existing CRError checks for dimensionality, shape, and finiteness, and the outer call is otherwise redundant. Validate first here, then batch the checked arrays.

Possible direction
 def auto_correlate(
     x: Union[np.ndarray, list[float]], *, nlags: Optional[int] = None, fft: bool = False
 ) -> np.ndarray:
-    x = np.asarray(x, dtype=np.float64)
+    x, _ = _validate_and_prepare_input(x)

     # Auto-batch if series exceeds safe length
     x = _auto_batch(x)

 def cross_correlate(
     x: Union[np.ndarray, list[float]],
     y: Union[np.ndarray, list[float], None],
@@
-    x = np.asarray(x, dtype=np.float64)
-    y = np.asarray(y, dtype=np.float64)
+    x, y = _validate_and_prepare_input(x, y)
+    assert isinstance(y, np.ndarray)

     # Auto-batch if series exceeds safe length
     x = _auto_batch(x)
     y = _auto_batch(y)

Also applies to: 610-615

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kim_convergence/stats/tools.py` around lines 546 - 550, The call to
_auto_batch is happening before validation, which allows oversized or malformed
inputs to reach batch() and bypass CRError checks; change the order in
auto_covariance and cross_covariance so that you first run
_validate_and_prepare_input (or the existing validation routine used elsewhere)
on x (and y where applicable) to enforce dimensionality/shape/finiteness, then
convert to np.asarray(dtype=np.float64) if needed, and only then call
_auto_batch on the already-validated arrays; apply the same reorder to the other
occurrence around lines 610-615 to ensure validation always precedes batching.

@yafshar yafshar closed this Mar 25, 2026
@yafshar
Copy link
Copy Markdown
Member Author

yafshar commented Mar 25, 2026

Closing this PR. Auto-batching inside the public statistical APIs (auto_covariance, cross_covariance, modified_periodogram, periodogram) is a semantic breaking change: once batched, lag k in the result corresponds to physical lag k× batch_size in the original series, but the API gives no way to recover that stride. This silently breaks the contract for any direct callers of these functions.
Will implement the deadlock protection at the trajectory acquisition layer (_trajectory.py) or in the internal UCL estimators instead, where we can safely decimate before the statistical analysis without breaking public API semantics. The hardened env var parsing may be salvaged in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants