Refactor: Use collections.Counter for efficient frequent element counting in SubjectProcessor#12111
Refactor: Use collections.Counter for efficient frequent element counting in SubjectProcessor#12111Harshul23 wants to merge 2 commits intointernetarchive:masterfrom
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR refactors SubjectProcessor._most_used in openlibrary/core/lists/engine.py to use the standard library’s collections.Counter for selecting the most frequent element, aiming to reduce boilerplate and avoid sorting the full frequency map.
Changes:
- Replace manual
defaultdictcounting + full sort withCounter(seq).most_common(1)to compute the most frequent element. - Update imports to use direct
from collections import ...usage and adjustdefaultdictreferences accordingly. - Add an empty-input guard in
_most_used(returningNone).
You can also share your feedback on Copilot code review. Take the survey.
| d[x] += 1 | ||
|
|
||
| return sorted(d, key=lambda k: d[k], reverse=True)[0] | ||
| """Returns the most frequent element in a sequence using collections.Counter.""" |
| import re | ||
| from collections import Counter, defaultdict |
| """Returns the most frequent element in a sequence using collections.Counter.""" | ||
| if not seq: | ||
| return None | ||
| return Counter(seq).most_common(1)[0][0] |
|
Thanks for working on this and referencing the issue! This implementation looks clean and aligns well with the intended optimization. The handling for empty sequences is also a nice improvement over the previous behavior. Looks good to me 👍 |
|
Thank you @Adeelp1 for the review! |
|
Hi @Adeelp1 , thank you for the feedback. I completely agree there is no point in refactoring code that isn't being used. I’m glad my PR helped identify this dead code path. I’m happy to close this PR in favor of the cleanup effort. |
Summary
This PR optimizes the
_most_usedmethod inopenlibrary/core/lists/engine.pyby replacing a manual counting and sorting implementation with Python's built-incollections.Counter.The Problem
The previous implementation of
_most_usedmanually populated adefaultdictand then sorted the entire dictionary to find the most frequent element. This resulted in:The Solution
By using
collections.Counter(seq).most_common(1), we achieve:most_common(n)is more efficient for finding top results (using a heap-based approach for smallIndexErrorexceptions.Closes #12100
Achieves: [refactor]
Technical
The refactor changes the time complexity of finding the mode of the sequence from$O(k \log k)$ to $O(n)$ . I also updated the module-level imports to use
from collections import Counter, defaultdictfor better clarity and to reduce verbosity in theSubjectProcessorclass.Testing
Nonefor empty sequences.pytestsuite within the Dockerwebcontainer. While the environment had some pathing issues, the core logic was independently verified against the expected behavior of theSubjectProcessor.Screenshot
N/A (Backend logic change only)
Stakeholders
@Adeelp1