fix(tools): auto-batch long time series to prevent MPI/BLAS deadlock#11
fix(tools): auto-batch long time series to prevent MPI/BLAS deadlock#11
Conversation
Time series >5M samples trigger deadlocks in FFT/correlation routines when running under MPI (observed at ~5.36M steps with 20 MPI ranks). Root cause is BLAS threading inside MPI context. We automatically batch (block-average) arrays exceeding safe length using batch() utility. Preserves mean and low-frequency statistics required for SI estimation while avoiding O(N^2) memory/thread explosion. - Add KIM_CONV_MAX_TSD_LENGTH env var (default 5M, set 0 to disable) - Apply auto-batching in auto_covariance, cross_covariance, and modified_periodogram before heavy computations - Uses simple mean-reduction; remainder points discarded per standard batch() behavior Relates-to: refactor/lammps-example-mpi-support
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds environment-controlled auto-batching for long 1D time-series in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
kim_convergence/stats/tools.py (2)
131-132: Consider adding error handling for invalid environment variable values.If
KIM_CONV_MAX_TSD_LENGTHis set to a non-integer string (e.g.,"abc"),int()will raise an unhandledValueErrorat module import time, which may be confusing for users. A graceful fallback with a warning would improve robustness.♻️ Proposed fix with error handling
# Auto-batching configuration -_MAX_TSD_LENGTH = int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000")) +_MAX_TSD_LENGTH_DEFAULT = 5000000 +try: + _MAX_TSD_LENGTH = int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", str(_MAX_TSD_LENGTH_DEFAULT))) +except ValueError: + _MAX_TSD_LENGTH = _MAX_TSD_LENGTH_DEFAULT + # Warning will be issued on first use since cr_warning may not be safe at import time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kim_convergence/stats/tools.py` around lines 131 - 132, The module-level constant _MAX_TSD_LENGTH is currently set via int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000")) and will raise a ValueError at import if the env var is non-numeric; update the assignment to catch conversion errors: read the env var using os.environ.get, try to int() it inside a try/except (catch ValueError/TypeError), log or warn about the invalid value and fall back to the default 5_000_000, and ensure _MAX_TSD_LENGTH is always assigned a valid int; reference the _MAX_TSD_LENGTH assignment in tools.py and any module-level import-time initialization when making the change.
227-230: Warning may be emitted repeatedly in iterative workflows.If a user processes multiple large time series in a loop (or calls these functions repeatedly), the warning will be emitted for each call, potentially flooding logs. Consider adding a mechanism to suppress repeated warnings or emit only once per session.
♻️ Optional: Add a "warned once" flag
+_auto_batch_warned = False + def _auto_batch(x: np.ndarray) -> np.ndarray: + global _auto_batch_warned ... if _MAX_TSD_LENGTH <= 0 or x.size <= _MAX_TSD_LENGTH: return x batch_size = (x.size + _MAX_TSD_LENGTH - 1) // _MAX_TSD_LENGTH - cr_warning( - f"Time series length ({x.size}) exceeds safe limit " - f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}." - ) + if not _auto_batch_warned: + cr_warning( + f"Time series length ({x.size}) exceeds safe limit " + f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}." + ) + _auto_batch_warned = True return batch(x, batch_size=batch_size, func=np.mean)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kim_convergence/stats/tools.py` around lines 227 - 230, The cr_warning call inside the large time-series handling emits the same message every time (when x.size > _MAX_TSD_LENGTH); to avoid log flooding, guard that call with a module-level "warned once" flag or set (e.g. _TS_LENGTH_WARNING_EMITTED) and only call cr_warning the first time, then mark the flag true; locate the site referencing _MAX_TSD_LENGTH, x.size, batch_size and replace the unconditional cr_warning(...) with an if not _TS_LENGTH_WARNING_EMITTED: cr_warning(...) and set _TS_LENGTH_WARNING_EMITTED = True (or alternatively switch to Python's warnings.warn with once filter).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 131-132: The module-level constant _MAX_TSD_LENGTH is currently
set via int(os.environ.get("KIM_CONV_MAX_TSD_LENGTH", "5000000")) and will raise
a ValueError at import if the env var is non-numeric; update the assignment to
catch conversion errors: read the env var using os.environ.get, try to int() it
inside a try/except (catch ValueError/TypeError), log or warn about the invalid
value and fall back to the default 5_000_000, and ensure _MAX_TSD_LENGTH is
always assigned a valid int; reference the _MAX_TSD_LENGTH assignment in
tools.py and any module-level import-time initialization when making the change.
- Around line 227-230: The cr_warning call inside the large time-series handling
emits the same message every time (when x.size > _MAX_TSD_LENGTH); to avoid log
flooding, guard that call with a module-level "warned once" flag or set (e.g.
_TS_LENGTH_WARNING_EMITTED) and only call cr_warning the first time, then mark
the flag true; locate the site referencing _MAX_TSD_LENGTH, x.size, batch_size
and replace the unconditional cr_warning(...) with an if not
_TS_LENGTH_WARNING_EMITTED: cr_warning(...) and set _TS_LENGTH_WARNING_EMITTED =
True (or alternatively switch to Python's warnings.warn with once filter).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9701348f-9769-4db6-9943-894cc44698ee
📒 Files selected for processing (1)
kim_convergence/stats/tools.py
|
@ilia-nikiforov-umn Another fix is ready for testing. Can you try it on your production LAMMPS case that was hanging at 5M steps? Quick test: # Fetch latest
git fetch origin
# Start from your LAMMPS example branch
git checkout refactor/lammps-example-mpi-support
# Create test branch with the fix merged on top
git checkout -b test-lammps-mpi-fix
git merge origin/fix/mpi-deadlock-auto-batch
# Install and run
pip install -e .
mpirun -np 20 lmp ...If it works: You'll see it pass 5.36M steps without hanging. You might see warnings like: Time series length (5360000) exceeds safe limit (5000000). Auto-batching with batch_size=2.That's expected and the protection is working. export KIM_CONV_MAX_TSD_LENGTH=3000000 # Try stricter limit
mpirun -np 20 lmp ...Let me know if it completes the full run or if you still see deadlocks. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
kim_convergence/stats/tools.py (1)
247-252: Warning deduplication may re-warn when batch sizes alternate.The current logic warns when
batch_size != _AUTO_BATCH_LAST_WARNED_SIZE. If processing alternates between series requiring different batch sizes (e.g., batch_size=2, then 3, then 2 again), the warning will repeat for the same batch_size values.Consider tracking a set of already-warned sizes if stricter deduplication is desired. However, the current approach is reasonable for the primary use case of avoiding log flooding in loops with similar-sized data.
♻️ Optional: Use a set for stricter deduplication
-# Module-level tracking -_AUTO_BATCH_LAST_WARNED_SIZE: int = 0 +# Module-level tracking +_AUTO_BATCH_WARNED_SIZES: set[int] = set() def _auto_batch(x: np.ndarray) -> np.ndarray: ... - global _AUTO_BATCH_LAST_WARNED_SIZE + global _AUTO_BATCH_WARNED_SIZES ... - # Warn only when batch_size changes to avoid log flooding - if batch_size != _AUTO_BATCH_LAST_WARNED_SIZE: + # Warn only once per batch_size to avoid log flooding + if batch_size not in _AUTO_BATCH_WARNED_SIZES: cr_warning( f"Time series length ({x.size}) exceeds safe limit " f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}." ) - _AUTO_BATCH_LAST_WARNED_SIZE = batch_size + _AUTO_BATCH_WARNED_SIZES.add(batch_size)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kim_convergence/stats/tools.py` around lines 247 - 252, The warning deduplication currently compares batch_size against _AUTO_BATCH_LAST_WARNED_SIZE which will re-warn if sizes alternate; change this to track a set of warned sizes (e.g., _AUTO_BATCH_WARNED_SIZES) and only call cr_warning when batch_size is not in that set, then add batch_size to the set; update any initialization of _AUTO_BATCH_LAST_WARNED_SIZE and uses around the block referencing batch_size, cr_warning, and _MAX_TSD_LENGTH to use the new set-based deduplication to avoid repeated warnings for previously seen sizes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 247-252: The warning deduplication currently compares batch_size
against _AUTO_BATCH_LAST_WARNED_SIZE which will re-warn if sizes alternate;
change this to track a set of warned sizes (e.g., _AUTO_BATCH_WARNED_SIZES) and
only call cr_warning when batch_size is not in that set, then add batch_size to
the set; update any initialization of _AUTO_BATCH_LAST_WARNED_SIZE and uses
around the block referencing batch_size, cr_warning, and _MAX_TSD_LENGTH to use
the new set-based deduplication to avoid repeated warnings for previously seen
sizes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 13d10682-0192-49de-b930-c971b11f102f
📒 Files selected for processing (1)
kim_convergence/stats/tools.py
d437fd5 to
84bd344
Compare
84bd344 to
80c552a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 461-462: Batching inside helper layers (_auto_batch called where x
is used) causes downstream functions to use original lengths/stats and produce
inconsistent normalization/lag limits; move batching to the public entry points
(auto_correlate, cross_correlate, periodogram) or else ensure the batched arrays
and their effective length/std are propagated through all downstream helpers.
Specifically, remove/avoid calling _auto_batch inline in lower-level helpers and
instead call _auto_batch at the start of auto_correlate, cross_correlate, and
periodogram (or modify their signatures to accept both the processed array and
its effective length/std), then compute nlags, normalization (std(x)*std(y)),
and Nyquist/rescaling using the batched/propagated values so all size-sensitive
logic uses the effective series size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0423b226-de87-4c57-9ea5-5d39eea4d1d1
📒 Files selected for processing (2)
doc/troubleshooting.rstkim_convergence/stats/tools.py
Batching in low-level primitives (auto_covariance, cross_covariance) broke downstream length accounting: - auto_correlate() capped nlags against unbatched x.size - cross_correlate() normalized with original std(x)*std(y) while cov used batched arrays - periodogram() used original length for Nyquist scaling after FFT on batched data - cross_correlate(x, x.copy())[0] could return < 1 due to size mismatch Fix: Move batching to public entry points (auto_correlate, cross_correlate, periodogram, modified_periodogram) so all size-sensitive logic (nlags capping, std normalization, Nyquist handling) uses the consistent batched length. Low-level primitives (auto_covariance, cross_covariance) no longer auto-batch; they assume inputs are already managed by callers. Also applies to: modified_periodogram batching location.
|
@yafshar it is still hanging. I merged this branch up to 6644d5b into the latest However, when I set the environment variable, I noticed that although the warning messages had the correct value: Setting the environment variable did not change the point in the simulation at which they started appearing. They warnings only started getting printed after step 5360000, when I would have expected them to start printing much earlier. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
kim_convergence/stats/tools.py (1)
222-255:⚠️ Potential issue | 🟠 MajorExpose the effective batching metadata.
Once
_auto_batch()averages blocks of sizeb,auto_correlate()[k]/cross_correlate()[k]are at original lagk*b, and the periodogram routines operate on a series sampled everybsteps.kim_convergence/batch.py:92-122also truncates the tail remainder, so the effective processed length changes as well. The only placebatch_sizesurvives here is the warning text, which is deduped and not part of the API. Please surfacebatch_size/ processed length (or return the axis alongside the values), otherwise callers cannot recover the correct lag or frequency coordinates once batching triggers.Also applies to: 540-540, 604-605, 680-682, 750-753
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kim_convergence/stats/tools.py` around lines 222 - 255, The _auto_batch function currently drops batch_size metadata (only logging it) which prevents callers like auto_correlate, cross_correlate and periodogram from reconstructing true lag/frequency axes; update _auto_batch (and other call sites noted) to return both the processed array and a small metadata tuple/object containing batch_size and processed_length (or alternately return (values, axis_scale) where axis_scale == batch_size and processed_length == values.size), update callers (auto_correlate, cross_correlate, periodogram) to accept the (values, meta) return and use meta.batch_size (or meta.axis_scale) to scale lags/frequencies and account for dropped tail points (batch truncation) so coordinates map back to the original series; keep backward compatibility by providing a single-value return when no batching occurred (or document the new return shape) and preserve use of _AUTO_BATCH_LAST_WARNED_SIZE for logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 39-50: The docstring claims KIM_CONV_MAX_TSD_LENGTH applies to
autocovariance and cross-covariance, but auto_covariance() and
cross_covariance() do not call _auto_batch() and currently pass large arrays
straight into _fft_corr(); update the implementation so auto_covariance(...) and
cross_covariance(...) invoke the same batching routine (_auto_batch) used by the
correlator/periodogram entry points before calling _fft_corr(), ensuring inputs
over KIM_CONV_MAX_TSD_LENGTH are block-averaged, or alternatively narrow the
docstring to only mention the correlator/periodogram entry points if you prefer
not to change behavior. Ensure you reference the functions auto_covariance,
cross_covariance, _auto_batch, and _fft_corr when making the change.
---
Duplicate comments:
In `@kim_convergence/stats/tools.py`:
- Around line 222-255: The _auto_batch function currently drops batch_size
metadata (only logging it) which prevents callers like auto_correlate,
cross_correlate and periodogram from reconstructing true lag/frequency axes;
update _auto_batch (and other call sites noted) to return both the processed
array and a small metadata tuple/object containing batch_size and
processed_length (or alternately return (values, axis_scale) where axis_scale ==
batch_size and processed_length == values.size), update callers (auto_correlate,
cross_correlate, periodogram) to accept the (values, meta) return and use
meta.batch_size (or meta.axis_scale) to scale lags/frequencies and account for
dropped tail points (batch truncation) so coordinates map back to the original
series; keep backward compatibility by providing a single-value return when no
batching occurred (or document the new return shape) and preserve use of
_AUTO_BATCH_LAST_WARNED_SIZE for logging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 23627af3-f22a-4b4c-8385-6032176de5a7
📒 Files selected for processing (1)
kim_convergence/stats/tools.py
Add _auto_batch() protection to auto_covariance() and cross_covariance() in addition to the existing entry points (auto_correlate, cross_correlate, modified_periodogram, periodogram). This ensures direct callers of the covariance primitives (e.g., statistical inefficiency calculations) are also protected from >5M MPI/BLAS deadlocks. Idempotent batching prevents double-batching when correlators internally call covariance functions—the second check is a cheap no-op if the array is already within limits. All public statistical APIs now safely handle arbitrarily long time series while maintaining consistent normalization in downstream calculations.
Can you try again with the latest commit. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
kim_convergence/stats/tools.py (1)
149-150: The warning de-dup key is too coarse.
_AUTO_BATCH_LAST_WARNED_SIZEis process-global and only keyed onbatch_size. After onebatch_size=2warning, every later input that also batches by2will go silent, even if it comes from a different API or a different original size. That is broader than the nested duplicate you are trying to suppress (auto_correlate→auto_covariance,periodogram→modified_periodogram) and makes the trigger point much harder to verify in MPI repros. Prefer per-call-chain suppression, or key the suppression on more than justbatch_size.Also applies to: 247-253
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kim_convergence/stats/tools.py` around lines 149 - 150, _AUTO_BATCH_LAST_WARNED_SIZE is too coarse; change the suppression to be keyed per call-chain (or include more context) instead of a single process-global int. Replace the int with a mapping keyed by a tuple that includes batch_size plus an identifier for the call-chain/source (for example function names like auto_correlate, auto_covariance, periodogram, modified_periodogram or a generated call-chain token), and update the warning logic to consult/update that map; alternatively use a ContextVar or thread-local keyed by call-chain to achieve per-call-chain suppression. Apply the same change where the same pattern appears (lines around the other occurrence noted).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kim_convergence/stats/tools.py`:
- Around line 546-550: The call to _auto_batch is happening before validation,
which allows oversized or malformed inputs to reach batch() and bypass CRError
checks; change the order in auto_covariance and cross_covariance so that you
first run _validate_and_prepare_input (or the existing validation routine used
elsewhere) on x (and y where applicable) to enforce
dimensionality/shape/finiteness, then convert to np.asarray(dtype=np.float64) if
needed, and only then call _auto_batch on the already-validated arrays; apply
the same reorder to the other occurrence around lines 610-615 to ensure
validation always precedes batching.
- Around line 223-255: The public APIs currently call _auto_batch which silently
changes the effective sampling stride (batch_size) so outputs from
auto_covariance, cross_covariance, modified_periodogram, and periodogram no
longer map to the original lag/frequency units; either keep _auto_batch usage
confined to internal convergence-only paths or change the public function
signatures to propagate the effective stride: detect batch_size inside
_auto_batch and return it (or add an optional out parameter/tuple return like
(result, stride)), update auto_covariance, cross_covariance,
modified_periodogram, and periodogram to accept/return the stride (and document
the new return shape), update all call sites and tests accordingly, and preserve
the existing warning behavior via _AUTO_BATCH_LAST_WARNED_SIZE so callers can
convert lag/frequency bins back to original units.
---
Nitpick comments:
In `@kim_convergence/stats/tools.py`:
- Around line 149-150: _AUTO_BATCH_LAST_WARNED_SIZE is too coarse; change the
suppression to be keyed per call-chain (or include more context) instead of a
single process-global int. Replace the int with a mapping keyed by a tuple that
includes batch_size plus an identifier for the call-chain/source (for example
function names like auto_correlate, auto_covariance, periodogram,
modified_periodogram or a generated call-chain token), and update the warning
logic to consult/update that map; alternatively use a ContextVar or thread-local
keyed by call-chain to achieve per-call-chain suppression. Apply the same change
where the same pattern appears (lines around the other occurrence noted).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 163c3119-35e8-46fb-8fa1-3d6162be5d17
📒 Files selected for processing (1)
kim_convergence/stats/tools.py
| def _auto_batch(x: np.ndarray) -> np.ndarray: | ||
| r"""Auto-batch time series data if it exceeds safe length limit. | ||
|
|
||
| Uses non-overlapping block averaging to reduce array length while | ||
| preserving mean and low-frequency statistics needed for convergence | ||
| analysis. Remainder points at the end are discarded (standard batch | ||
| behavior). | ||
|
|
||
| Args: | ||
| x (1darray): Input time series data. | ||
|
|
||
| Returns: | ||
| 1darray: Original array if length <= limit, otherwise batched | ||
| (down-sampled) array with length <= _MAX_TSD_LENGTH. | ||
| """ | ||
| global _AUTO_BATCH_LAST_WARNED_SIZE | ||
|
|
||
| if _MAX_TSD_LENGTH <= 0 or x.size <= _MAX_TSD_LENGTH: | ||
| return x | ||
|
|
||
| # Calculate batch size to ensure result fits within limit | ||
| # Ceiling division: (n + limit - 1) // limit | ||
| batch_size = (x.size + _MAX_TSD_LENGTH - 1) // _MAX_TSD_LENGTH | ||
|
|
||
| # Warn only when batch_size changes to avoid log flooding | ||
| if batch_size != _AUTO_BATCH_LAST_WARNED_SIZE: | ||
| cr_warning( | ||
| f"Time series length ({x.size}) exceeds safe limit " | ||
| f"({_MAX_TSD_LENGTH}). Auto-batching with batch_size={batch_size}." | ||
| ) | ||
| _AUTO_BATCH_LAST_WARNED_SIZE = batch_size | ||
|
|
||
| return batch(x, batch_size=batch_size, func=np.mean) |
There was a problem hiding this comment.
Keep auto-batching out of the exact public stats APIs, or expose the stride.
Once _auto_batch() runs, these functions are no longer computing the covariance/correlation/periodogram of the original series; they are computing it for the block-averaged series. That means lag/frequency bin k now corresponds to k * batch_size on the original sampling grid, but none of these exported APIs return batch_size or any equivalent stride. Callers cannot recover the original units, so this is a semantic break for auto_covariance(), cross_covariance(), modified_periodogram(), and periodogram(), not just an implementation detail. Please either keep batching inside convergence-specific internal paths, or propagate the effective stride alongside the result.
Also applies to: 462-463, 511-513, 690-691, 762-763
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kim_convergence/stats/tools.py` around lines 223 - 255, The public APIs
currently call _auto_batch which silently changes the effective sampling stride
(batch_size) so outputs from auto_covariance, cross_covariance,
modified_periodogram, and periodogram no longer map to the original
lag/frequency units; either keep _auto_batch usage confined to internal
convergence-only paths or change the public function signatures to propagate the
effective stride: detect batch_size inside _auto_batch and return it (or add an
optional out parameter/tuple return like (result, stride)), update
auto_covariance, cross_covariance, modified_periodogram, and periodogram to
accept/return the stride (and document the new return shape), update all call
sites and tests accordingly, and preserve the existing warning behavior via
_AUTO_BATCH_LAST_WARNED_SIZE so callers can convert lag/frequency bins back to
original units.
| x = np.asarray(x, dtype=np.float64) | ||
|
|
||
| # Auto-batch if series exceeds safe length | ||
| x = _auto_batch(x) | ||
|
|
There was a problem hiding this comment.
Validate before these _auto_batch() calls.
auto_covariance() / cross_covariance() already batch after _validate_and_prepare_input(). Doing it again here means oversized malformed inputs hit batch() before the existing CRError checks for dimensionality, shape, and finiteness, and the outer call is otherwise redundant. Validate first here, then batch the checked arrays.
Possible direction
def auto_correlate(
x: Union[np.ndarray, list[float]], *, nlags: Optional[int] = None, fft: bool = False
) -> np.ndarray:
- x = np.asarray(x, dtype=np.float64)
+ x, _ = _validate_and_prepare_input(x)
# Auto-batch if series exceeds safe length
x = _auto_batch(x)
def cross_correlate(
x: Union[np.ndarray, list[float]],
y: Union[np.ndarray, list[float], None],
@@
- x = np.asarray(x, dtype=np.float64)
- y = np.asarray(y, dtype=np.float64)
+ x, y = _validate_and_prepare_input(x, y)
+ assert isinstance(y, np.ndarray)
# Auto-batch if series exceeds safe length
x = _auto_batch(x)
y = _auto_batch(y)Also applies to: 610-615
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kim_convergence/stats/tools.py` around lines 546 - 550, The call to
_auto_batch is happening before validation, which allows oversized or malformed
inputs to reach batch() and bypass CRError checks; change the order in
auto_covariance and cross_covariance so that you first run
_validate_and_prepare_input (or the existing validation routine used elsewhere)
on x (and y where applicable) to enforce dimensionality/shape/finiteness, then
convert to np.asarray(dtype=np.float64) if needed, and only then call
_auto_batch on the already-validated arrays; apply the same reorder to the other
occurrence around lines 610-615 to ensure validation always precedes batching.
|
Closing this PR. Auto-batching inside the public statistical APIs (auto_covariance, cross_covariance, modified_periodogram, periodogram) is a semantic breaking change: once batched, lag k in the result corresponds to physical lag k× batch_size in the original series, but the API gives no way to recover that stride. This silently breaks the contract for any direct callers of these functions. |
Summary
Running kim-convergence with LAMMPS on 20+ MPI ranks hangs indefinitely when time series exceed ~5 million samples. GDB traces show:
wait4()(Python multiprocessing)Root cause: NumPy/BLAS threading conflicts within MPI context when computing autocovariance/periodograms on large arrays.
Solution
Automatic batching (block-averaging) when
KIM_CONV_MAX_TSD_LENGTHis exceeded.Changes
_auto_batch()helper using existingbatch()utilityauto_covariance()cross_covariance()modified_periodogram()KIM_CONV_MAX_TSD_LENGTHcontrols limit (default: 5M, set to 0 to disable)Why this works
Testing
Related
Summary by CodeRabbit
New Features
Documentation