Skip to content

Conversation

@kempa-liehr
Copy link
Collaborator

This pull fixes the bin boundaries such that the very first bin contains a reasonable number of time-series value.
While this fix has a nearly neglectable impact on the test cases, it significantly changes the feature value for real-world applications and improves the feature's interpretability.

    This fix mainly improves the interpretability of the feature, while the complexity measure itself seems to be fairly robust against this change.
@nils-braun
Copy link
Collaborator

Thanks @kempa-liehr. I think the failing tests are unrelated (but something we should have a look into nevertheless).

@MaxBenChrist
Copy link
Collaborator

can we merge this?

inc = 1
while ind + inc <= n:
# convert tu tuple to make it hashable
# convert to tuple in order to make it hashable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that this is not an efficient implementation.

If we want to count the occurrence of substrings, we can use a counter for collections (collections.Counter). If we want to count the amount of different substrings, we can just call set on the list of substrings, without the loop

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But lemel_ziv isn't neither the amount of fixed-length substrings nor the counter of fixed length substrings. The idea behind it is that the length of the substrings is chosen in a way to never have any substring occuring doubled.

Example: 100111
First substring: just 1
Second: just 0
Third: would be 0 again, but we have 0 already so we need to choose 01
Forth: would be 1, but we have 1 already so we need to choose 11
End of the word -> 4 substrings / length 6 = 0.66.

The code is O(n) (the while loop needs approx n steps in in each we have a set check, which is O(1) and some other operations like addition, which are also O(1)), and I do not think we can be asymptotically better than this. There exist cython implementations of that - but as long as those are not accessible easily via conda and pip, I do not see any nice way for us.

Having some compiled (cython/numba) feature calculators would be really nice, but so far no one had time to have a look into this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation nils!

@nils-braun
Copy link
Collaborator

I have just merged in the newest changes from main. Let's see if that fixed the failing tests

@codecov-io
Copy link

codecov-io commented Mar 20, 2021

Codecov Report

Merging #806 (98bcd43) into main (26af836) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #806   +/-   ##
=======================================
  Coverage   95.92%   95.92%           
=======================================
  Files          18       18           
  Lines        1841     1841           
  Branches      362      362           
=======================================
  Hits         1766     1766           
  Misses         37       37           
  Partials       38       38           
Impacted Files Coverage Δ
tsfresh/feature_extraction/feature_calculators.py 97.29% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26af836...98bcd43. Read the comment docs.

@nils-braun nils-braun merged commit cb7943e into main Mar 20, 2021
@nils-braun nils-braun deleted the fix_lempel_ziv branch March 20, 2021 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants