-
Notifications
You must be signed in to change notification settings - Fork 266
Description
Describe the feature or idea you want to propose
Yes, this request is related to a significant performance and scalability problem in the current dtw_distance implementation.
The function currently allocates the full O(N*M) cost matrix in memory. For long time series (e.g., N > 10,000), this leads to very high memory consumption (several gigabytes) and can cause MemoryError. This makes it unfeasible to use DTW on large-scale datasets, especially in parallel environments where multiple workers would amplify the memory pressure. This also results in poor cache performance, limiting execution speed even for moderately sized series.
Describe your proposed solution
My proposed solution is to refactor the internal helper function responsible for the distance calculation (_dtw_distance) to use a space-efficient algorithm with O(M) memory complexity.
This approach calculates the distance iteratively using only two row vectors to store the costs of the previous and current rows, avoiding the allocation of the full matrix entirely. This is a standard dynamic programming optimization pattern that fits perfectly here.
The key benefits are:
- Massive Memory Reduction: Benchmarks show a >99.9% decrease in peak memory usage, making DTW viable for datasets of virtually any size.
- Consistent Speedup (~2x): The improved data locality results in a consistent ~2x speedup for non-trivial series lengths.
This change would maintain the existing API and numerical correctness while drastically improving the performance and scalability of all estimators that rely on distance="dtw".
Describe alternatives you've considered, if relevant
The alternative is to maintain the current implementation. This is simpler from a code-sharing perspective (as dtw_distance reuses dtw_cost_matrix), but it comes at a very high performance and memory cost that limits the practical application of the library for many real-world use cases involving long time series.
Additional context
I have already implemented and benchmarked a prototype of this solution. It successfully passes the entire local test suite (pytest aeon/), confirming numerical equivalence with the baseline.
I am ready to submit a Pull Request with the implementation and detailed benchmarks for review. Opening this issue first to ensure this is a welcome enhancement.