Currently the process module has the following functions:
| function |
kind |
explanation |
| extractOne |
one x many |
returns the best match as (choice, score, index/key) |
| extract |
one x many |
returns the best matches until limit as list[(choice, score, index/key)] |
| extract_iter |
one x many |
generator yielding (choice, score, index/key). Usage is not really recommended, since it is far slower than the others |
| cdist |
many x many |
returns all results as numpy matrix |
It would be nice to have equivalents of extractOne / extract for many x many. They would need less memory than cdist, which can take a large amount of memory when len(queries) and len(choices) are large.
| function |
kind |
explanation |
| - |
many x many |
returns the best matches as list[(choice, score, index)] |
| - |
many x many |
returns the best matches until limit as list[list[(choice, score, index)]] |
| - |
one x many |
returns all result without any sorting like cdist |
A first thought might be to overload the existing extractOne / extract on the type passed as query / queries. However this is not possible, since the following is a valid usage of these methods:
extractOne(["hello", "world"], [["hello", "world"]])
which can not be distinguished from many x many. For this reason these functions need a new API.
Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than score_cutoff. These could be cases:
This could be automatically done when the user passes dtype=bool.
Any suggestions on the naming of these new API's are welcome.
Currently the process module has the following functions:
It would be nice to have equivalents of
extractOne/extractformany x many. They would need less memory thancdist, which can take a large amount of memory whenlen(queries)andlen(choices)are large.A first thought might be to overload the existing
extractOne/extracton the type passed asquery/queries. However this is not possible, since the following is a valid usage of these methods:which can not be distinguished from
many x many. For this reason these functions need a new API.Beside this in many cases users are not actually interested, but only care about finding elements with a score, which is better than the score_cutoff. These could potentially be implemented more efficiently, since the implementation could quit once it is known, that they are better than
score_cutoff. These could be cases:This could be automatically done when the user passes
dtype=bool.Any suggestions on the naming of these new API's are welcome.