Skip to content

[MNT] enforce strict CPU-GPU numerical parity for ROCKET (< 1e-7 divergence in Kernel creation)#3177

Closed
Adityakushwaha2006 wants to merge 14 commits intoaeon-toolkit:mainfrom
Adityakushwaha2006:fix/rocket-gpu-numerical-parity
Closed

[MNT] enforce strict CPU-GPU numerical parity for ROCKET (< 1e-7 divergence in Kernel creation)#3177
Adityakushwaha2006 wants to merge 14 commits intoaeon-toolkit:mainfrom
Adityakushwaha2006:fix/rocket-gpu-numerical-parity

Conversation

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

-Fixes #1248

What does this implement/fix? Explain your changes.

Executive Summary: Fixing Critical CPU-GPU Divergence (< 1e-7 Error)
This PR resolves a critical scientific reproducibility issue in ROCKETGPU. Prior to this fix, the GPU implementation exhibited massive statistical divergence from the CPU baseline (up to 68.9% mean error on multivariate data), rendering cross-platform experiments invalid.

This fix restores numerical parity by enforcing strict architectural alignment, reducing average divergence from ~11% to 0.000007% (< 10^{-7}) across all tested datasets.

1. The Problem: Why this is Critical
During an empirical audit, I discovered ROCKETGPU produced fundamentally different results than Rocket (CPU) despite identical random_state.

  • Multivariate Failure: On the BasicMotions dataset, specific features showed a 2,050% difference (CPU: 3.67 vs GPU: 78.88), confirming the GPU was using a mathematically distinct algorithm.
  • Univariate Failure: Even simple datasets like GunPoint showed ~9.6% divergence due to RNG desynchronization.

2. Root Cause Analysis & Fixes
I identified and resolved three specific structural defects:

  • Fixed "Two-Loop" RNG Mismatch:

  • Issue: The CPU pre-generates all channel counts in a primary loop, then weights in a secondary loop. The GPU used a single loop, causing the RNG sequence to desynchronize at Call [ENH] refactor BaseClassifier attribute _threads_to_use to _n_jobs #102.

  • Fix: Refactored _define_parameters in _rocket_gpu.py to mirror the CPU's two-loop architecture exactly.

  • Fixed Channel Sorting Bug (Multivariate Error):

  • Issue: The GPU implementation explicitly sorted randomly selected channel indices (e.g., [4, 1, 3] -> [1, 3, 4]). Due to floating-point associativity, this changed the convolution's topological summation, causing the 68.9% divergence in multivariate data.

  • Fix: Removed the np.sort() operation. The GPU now processes channels in the exact same random order as the CPU.

  • Enforced Hardware Determinism:

  • Issue: CUDA kernels are non-deterministic by default.

  • Fix: Added flags to force TensorFlow to use deterministic ops (TF_DETERMINISTIC_OPS='1').

3. Verification & Results
Post-fix benchmarking confirms statistical parity is restored:
image

4. Backward Compatibility
Added a legacy_rng flag to ROCKETGPU:

  • legacy_rng=True: Uses old (divergent) behavior (for existing users).
  • legacy_rng=False (Default): Uses new, strictly reproducible logic.

Does your contribution introduce a new dependency? If yes, which one?

No.

Any other comments?

This PR prioritizes scientific correctness over speed. The strict synchronization introduces overhead for small datasets (e.g., BasicMotions), but this is an intentional trade-off to guarantee validity. Future optimization can address performance via batch parameter injection.

A new test test_rocket_cpu_gpu_parity has been added to verify divergence stays < 1e-5.

PR checklist

For all contributions
For new estimators and functions
  • I've added the estimator/function to the online [API documentation](https://www.aeon-toolkit.org/en/latest/api_reference.html).
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access

- Resolved structural RNG mismatch: aligned GPU kernel generation loops to match CPU's two-loop architecture
- Fixed multivariate divergence: removed channel index sorting in GPU that altered convolution topology
- Added legacy_rng flag: ensures backward compatibility for existing users
- Configured TF determinism: eliminated non-deterministic hardware noise in CUDA operations
- Achieved < 1e-7 divergence on GunPoint, BasicMotions, and ECG200
@aeon-actions-bot aeon-actions-bot bot added maintenance Continuous integration, unit testing & package distribution transformations Transformations package labels Dec 17, 2025
@aeon-actions-bot
Copy link
Copy Markdown
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ maintenance ].
I have added the following labels to this PR based on the changes made: [ transformations ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Regenerate expected results for testing
  • Push an empty commit to re-run CI checks

@Adityakushwaha2006 Adityakushwaha2006 marked this pull request as draft December 17, 2025 18:48
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Adityakushwaha2006 commented Dec 17, 2025

I have opened this as a Draft to share the architectural fixes.
Please ignore any CI/Linting failures for the moment.
Current State:
Correctness: The logic is verified locally and achieves strict parity (less than 10^{-7} divergence) across all test datasets.
Performance: As expected, the strict synchronization introduces overhead.
Immediate Next Steps:
I am now moving to Phase 2: Optimization to recover the speed loss. Once the performance is stabilized, I will clean up the CI checks, finalize the documentation, and mark this ready for final review.

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Adityakushwaha2006 commented Dec 18, 2025

Update:
I ran the current script for all the 15 available test cases, and i noticed some concerns in the current PR code.
These tests now constitute of full GPU vs CPU runs, and its still showing parity in final feature generation.

I had tested 'kernel' generation matches earlier, and theres less than 10^-7 divergence in parity there, that too simply because of computation methods and rounding errors.
What i had assumed earlier is that the method of convolution would 'have' some effect and the divergence would be affected by the transformation step , but it wouldnt be as major as the fix step ( if kernels are different everything down the line messes up , divergence and randomness gets multiplied)...as the tests revel leven with kernels being almost same ....the transformation step introduces some parity.
The parity isnt extreme, and goes up to under 0.5% in most datasets, with some beinng around 1%...
However i did find an outlier here, BasicMotions dataset outputs a 244% divergence, which is simply only caused due to transfom step induced divergence , the kernel similarity there as seen above as well is less than 10^-7 .

Just wanted to inform that before moving forward with the speed optimisation of the gpu code and merging that finally, i will be intermediately focusing on minimizing , convolution based parity, which will also result in an intermediate draft PR.

image KERNEL SIMILARITY - PARITY MATCH

- Add legacy_rng parameter (default=False) for CPU parity
- Fix RNG sequence in _define_parameters() to match CPU exactly
- Achieve <10^-7 kernel divergence across 15 datasets
- Add deprecation warning for legacy_rng=True
- Update test to verify kernel parity (Phase 1 scope)
- Document known limitation: multivariate transform divergence (Phase 2)

Fixes aeon-toolkit#1248

Phase 1: Kernel generation parity - COMPLETE
Phase 2: Transform parity for multivariate - Future work
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch 2 times, most recently from 877e693 to c1bd777 Compare December 19, 2025 10:26
@Adityakushwaha2006 Adityakushwaha2006 marked this pull request as ready for review December 19, 2025 15:50
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

The CI failures on this PR are not caused by the ROCKET GPU changes. They're caused by a pre-existing bug in the AEFCN network module that affects all tests across the repository.

Root Cause: AEFCNNetwork.build_network() in aeon/networks/_ae_fcn.py passes a numpy scalar to Dense(units=...), which Keras rejects.

Error: ValueError: Received an invalid value for units, expected a positive integer. Received: units=19200

Location: Line in _ae_fcn.py:
latent_space = tf.keras.layers.Dense(units=np.prod(shape_before_flattent))(flatten_layer)

Why This Blocks My PR

  • This PR only modifies: rocketGPU/_rocket_gpu.py and rocketGPU/tests/test_base_rocketGPU.py
  • CI runs the entire test suite before testing ROCKET changes
  • AEFCN tests fail early, preventing ROCKET tests from running
  • All 7 CI jobs (macOS, Ubuntu, Windows) fail on the same AEFCN error

TensorFlow Keras Dense layer requires Python int for units parameter.
Using np.prod() directly returns numpy.int64 which Keras rejects with:
ValueError: Received an invalid value for units, expected a positive integer.

Fixed in 3 AE network classes:
- AEFCNNetwork: decoder Dense layer
- AEResNetNetwork: decoder Dense layer
- AEDCNNNetwork: decoder Dense layer

This fix unblocks all TF-based CI tests and notebooks.
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch from 66d929c to 7a2d0b3 Compare December 19, 2025 18:51
TensorFlow requires float32 tensors but test was passing float64.
Error: 'expected float tensor but is double tensor'

Added .astype(np.float32) conversion to test data.
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch from ce5082c to 31179f6 Compare December 19, 2025 19:13
TensorFlow Conv2D requires float32 tensors but numpy operations
were upcasting to float64, causing errors across all platforms.

Fixed two sources of float64 upcasting:
1. Kernel normalization: Added .astype(np.float32) after reshape
2. Bias generation: Wrapped np.random.uniform() with np.float32()

This ensures all tensors passed to TF operations are float32.
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch from 40dabeb to 6aec859 Compare December 19, 2025 19:50
TensorFlow automatically converts numpy arrays to float64 tensors by default,
even if the numpy array is float32. This was causing Conv2D dtype errors.

Solution: Wrap kernel in tf.constant(..., dtype=tf.float32) before passing
to conv1d to ensure TensorFlow receives float32 tensor.

Also added float32 casts to kernel normalization and bias generation as
defense-in-depth, though the tf.constant conversion is the critical fix.

Tested locally - dtype errors resolved.
The same PPV/Max features were being appended twice to output_features_filter,
causing output to have 2x the expected number of samples (e.g., 20 instead of 10).

Removed duplicate append block.
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch from 40fc8f2 to 7a069cc Compare December 19, 2025 20:43
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Note for the reviewer:
This PR also contains the fix implemented in the the PR #3182 , since 3182 is urgent , if it is merged first then if needed i can rebase this , and then run the cl again

@hadifawaz1999
Copy link
Copy Markdown
Member

you can update branch to main, the PR for the AE fix got merged

@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch from c60ce8a to 68f789f Compare December 20, 2025 17:42
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the fix/rocket-gpu-numerical-parity branch 3 times, most recently from ccdb24a to 7a069cc Compare December 20, 2025 18:33
Adityakushwaha2006 and others added 3 commits December 21, 2025 00:04
Added tf.random.set_seed(1) to 3 test files to resolve RuntimeError
when TensorFlow determinism is enabled but no seed is set.

Files fixed:
- aeon/classification/shapelet_based/tests/test_ls.py
- aeon/regression/deep_learning/tests/test_deep_regressor_base.py
- aeon/networks/tests/test_deepar.py

All tests now pass locally with determinism enabled.
@Adityakushwaha2006 Adityakushwaha2006 changed the title [MNT] enforce strict CPU-GPU numerical parity for ROCKET (< 1e-7 divergence) [MNT] enforce strict CPU-GPU numerical parity for ROCKET (< 1e-7 divergence in Kernel creation) Dec 22, 2025
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Adityakushwaha2006 commented Dec 22, 2025

(STILL REVIEW READY)

Hey , found a Bug in the implementation while i was playing around with benchmarking performance.
It arises in the from kernel divergence for multivariate datasets.

This does not change anything quantifiably for the performance based on this PR tho, multivariate datasets were producing divergence either way, this is just another place i am going to focus my analysis into.

The current PR , does ensure that univariate kernel generation and feature generation achieve parity .

UPDATE:

It was just a simple ordering issue in the feature storage in GPU and CPU (just a benchmarking issue actually) , the kernel values were same , their ordering differed in both , so they are producing the same kernels . This causes no issue for users or implementation, however to make testing easy , ill change ordering in future PRs where i work on transform step parity achievement.

(informing as the same might be visible in your validation and testing scripts aswell)

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

This PR is superseeded by PR #3211 , with a CuPy backend.
With the current limitations to Tensorflow approach , it holds no merit compared to the 100% parity and 100x speedup in computation and full multivariate support offered by the CuPy GPU variant .

Closing this PR due to the same @hadifawaz1999.

@Adityakushwaha2006 Adityakushwaha2006 marked this pull request as draft January 4, 2026 22:41
@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Currently Drafted , ill confirm and complete this PR based upon the decision made in discussions of PR #3211 .
Please hold off a review till undrafted.

@Adityakushwaha2006
Copy link
Copy Markdown
Contributor Author

Closing this PR , this implementation has gotten messy and isnt the best way to go about the changes , much so after the discussions in the related PR. I shall take a look at this later , with a better and cleaner approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintenance Continuous integration, unit testing & package distribution transformations Transformations package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MNT] Test consistency between CPU & GPU version of ROCKET

2 participants