Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26
Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26
Conversation
Summary of ChangesHello @NHLOCAL, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and accuracy of singer name matching by introducing an adaptive Levenshtein distance algorithm. The changes aim to improve the system's ability to identify artists despite minor discrepancies or longer names, while critically avoiding incorrect associations that could arise from partial word matches or common prefixes. This ensures more reliable artist identification across the application. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a more sophisticated fuzzy matching algorithm for artist names, replacing a simple one-letter difference check with an adaptive Levenshtein distance-based approach. This is a significant improvement that should increase matching accuracy, especially for longer names, while new safeguards help prevent common false positives. The changes are well-structured, and the inclusion of a comprehensive unit test suite is excellent. I've suggested a minor refactoring to improve readability in one of the new functions.
| total_distance = 0 | ||
| is_valid_window = True | ||
|
|
||
| for candidate_word, artist_word in zip(candidate_words, artist_words): | ||
| if _is_prefix_or_suffix_expansion(candidate_word, artist_word): | ||
| is_valid_window = False | ||
| break | ||
|
|
||
| max_word_distance = _max_allowed_word_distance(len(artist_word)) | ||
| word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance) | ||
|
|
||
| if word_distance > max_word_distance: | ||
| is_valid_window = False | ||
| break | ||
|
|
||
| total_distance += word_distance | ||
| if total_distance > max_phrase_distance: | ||
| is_valid_window = False | ||
| break | ||
|
|
||
| if is_valid_window: | ||
| return True |
There was a problem hiding this comment.
The use of the is_valid_window flag can be simplified by using a for...else block. The else block of a for loop is executed when the loop completes without encountering a break statement. This makes the code more concise and Pythonic.
| total_distance = 0 | |
| is_valid_window = True | |
| for candidate_word, artist_word in zip(candidate_words, artist_words): | |
| if _is_prefix_or_suffix_expansion(candidate_word, artist_word): | |
| is_valid_window = False | |
| break | |
| max_word_distance = _max_allowed_word_distance(len(artist_word)) | |
| word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance) | |
| if word_distance > max_word_distance: | |
| is_valid_window = False | |
| break | |
| total_distance += word_distance | |
| if total_distance > max_phrase_distance: | |
| is_valid_window = False | |
| break | |
| if is_valid_window: | |
| return True | |
| total_distance = 0 | |
| for candidate_word, artist_word in zip(candidate_words, artist_words): | |
| if _is_prefix_or_suffix_expansion(candidate_word, artist_word): | |
| break | |
| max_word_distance = _max_allowed_word_distance(len(artist_word)) | |
| word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance) | |
| if word_distance > max_word_distance: | |
| break | |
| total_distance += word_distance | |
| if total_distance > max_phrase_distance: | |
| break | |
| else: | |
| return True |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbcd335e67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for source_name, target_name in self.singer_list: | ||
| if source_name in split_file: | ||
| exact = check_exact_name(split_file, source_name) | ||
| if exact: | ||
| found_artists.append(target_name) | ||
| break # מצאנו אמן, אין צורך להמשיך | ||
| exact = check_exact_name(split_file, source_name) | ||
| if exact: |
There was a problem hiding this comment.
Run exact singer matching before fuzzy matching
This loop now breaks on the first check_exact_name hit, but check_exact_name includes fuzzy Levenshtein matching, so an earlier near-duplicate in self.singer_list can win before the true exact artist entry is reached. With the current src/core/app/singer-list.csv, an input like אליקם בוטה is matched to the earlier אליקים בוטה row, which routes files to the wrong artist folder even when the exact name exists later in the list.
Useful? React with 👍 / 👎.
| if re.search(exact_match_pattern, filename): | ||
| return True | ||
|
|
||
| return _is_similar_name_match(filename, artist_to_search) |
There was a problem hiding this comment.
Avoid fuzzy distance checks for every candidate name
Falling back to _is_similar_name_match on every non-exact check makes each singer-list scan run edit-distance logic for almost all rows, including obvious non-candidates. In practice this is a major runtime regression for no-hit inputs (local check against singer-list.csv: ~28.5s for 200 lookups vs ~0.013s previously), so large sorting runs can slow dramatically when filenames/metadata do not match known artists.
Useful? React with 👍 / 👎.
Motivation
ו-prefixed matches.אליvsיואליorמוטיvsלמוטי).Description
src/core/check_name.pythat sets per-word and per-phrase distance thresholds via_max_allowed_word_distanceand_max_allowed_phrase_distance._tokenize_words), an optimized Levenshtein implementation (_levenshtein_distance), and a safeguard to reject prefix/suffix expansions (_is_prefix_or_suffix_expansion).וprefix) as the first check incheck_exact_nameand only falls back to fuzzy matching when exact search fails.src/core/singles_sorter_v5.pyto callcheck_exact_namedirectly (removed the preliminaryif source_name in ...containment check) so the new fuzzy logic is applied consistently.src/tests/test_check_name_similarity.pycovering exact matches,וprefix, one-letter variations, long-name tolerance, rejection on excessive differences, prevention of prefix/suffix false-positives, multi-word window matching, and window-size requirements.Testing
python -m unittest src/tests/test_check_name_similarity.pyand all tests passed (OK).PYTHONPATH=src/core python src/tests/legacy/assert_check_name.pywhich completed without errors.src/core/app/singer-list.csvthat verified known negative caseיואלי קלייןvsאלי קלייןand reported no unexpected false hits in the sampled pairs.Codex Task