-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix conversion of time-series into sequence for lempel_ziv_complexity #806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fix mainly improves the interpretability of the feature, while the complexity measure itself seems to be fairly robust against this change.
|
Thanks @kempa-liehr. I think the failing tests are unrelated (but something we should have a look into nevertheless). |
|
can we merge this? |
| inc = 1 | ||
| while ind + inc <= n: | ||
| # convert tu tuple to make it hashable | ||
| # convert to tuple in order to make it hashable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that this is not an efficient implementation.
If we want to count the occurrence of substrings, we can use a counter for collections (collections.Counter). If we want to count the amount of different substrings, we can just call set on the list of substrings, without the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But lemel_ziv isn't neither the amount of fixed-length substrings nor the counter of fixed length substrings. The idea behind it is that the length of the substrings is chosen in a way to never have any substring occuring doubled.
Example: 100111
First substring: just 1
Second: just 0
Third: would be 0 again, but we have 0 already so we need to choose 01
Forth: would be 1, but we have 1 already so we need to choose 11
End of the word -> 4 substrings / length 6 = 0.66.
The code is O(n) (the while loop needs approx n steps in in each we have a set check, which is O(1) and some other operations like addition, which are also O(1)), and I do not think we can be asymptotically better than this. There exist cython implementations of that - but as long as those are not accessible easily via conda and pip, I do not see any nice way for us.
Having some compiled (cython/numba) feature calculators would be really nice, but so far no one had time to have a look into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation nils!
|
I have just merged in the newest changes from main. Let's see if that fixed the failing tests |
Codecov Report
@@ Coverage Diff @@
## main #806 +/- ##
=======================================
Coverage 95.92% 95.92%
=======================================
Files 18 18
Lines 1841 1841
Branches 362 362
=======================================
Hits 1766 1766
Misses 37 37
Partials 38 38
Continue to review full report at Codecov.
|
This pull fixes the bin boundaries such that the very first bin contains a reasonable number of time-series value.
While this fix has a nearly neglectable impact on the test cases, it significantly changes the feature value for real-world applications and improves the feature's interpretability.