Skip to content

Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26

Open
NHLOCAL wants to merge 3 commits intomainfrom
codex/improve-artist-name-recognition-logic
Open

Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26
NHLOCAL wants to merge 3 commits intomainfrom
codex/improve-artist-name-recognition-logic

Conversation

@NHLOCAL
Copy link
Owner

@NHLOCAL NHLOCAL commented Feb 15, 2026

Motivation

  • Improve the previous one-letter fuzzy matcher to better handle long artist names while avoiding new false positives.
  • Make matching robust to small OCR/typo errors in filenames and metadata while preserving exact and ו-prefixed matches.
  • Prevent spurious matches introduced by simple prefix/suffix expansions (for example אלי vs יואלי or מוטי vs למוטי).

Description

  • Replaced the prior one-letter-only logic with an adaptive Levenshtein-based matcher in src/core/check_name.py that sets per-word and per-phrase distance thresholds via _max_allowed_word_distance and _max_allowed_phrase_distance.
  • Added tokenization (_tokenize_words), an optimized Levenshtein implementation (_levenshtein_distance), and a safeguard to reject prefix/suffix expansions (_is_prefix_or_suffix_expansion).
  • Kept exact matching (including optional ו prefix) as the first check in check_exact_name and only falls back to fuzzy matching when exact search fails.
  • Updated usages in src/core/singles_sorter_v5.py to call check_exact_name directly (removed the preliminary if source_name in ... containment check) so the new fuzzy logic is applied consistently.
  • Added unit tests in src/tests/test_check_name_similarity.py covering exact matches, ו prefix, one-letter variations, long-name tolerance, rejection on excessive differences, prevention of prefix/suffix false-positives, multi-word window matching, and window-size requirements.

Testing

  • Ran python -m unittest src/tests/test_check_name_similarity.py and all tests passed (OK).
  • Ran legacy assertions via PYTHONPATH=src/core python src/tests/legacy/assert_check_name.py which completed without errors.
  • Performed a sampling sanity check against src/core/app/singer-list.csv that verified known negative case יואלי קליין vs אלי קליין and reported no unexpected false hits in the sampled pairs.

Codex Task

@gemini-code-assist
Copy link

Summary of Changes

Hello @NHLOCAL, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and accuracy of singer name matching by introducing an adaptive Levenshtein distance algorithm. The changes aim to improve the system's ability to identify artists despite minor discrepancies or longer names, while critically avoiding incorrect associations that could arise from partial word matches or common prefixes. This ensures more reliable artist identification across the application.

Highlights

  • Improved Fuzzy Matching: Replaced the previous one-letter fuzzy matcher with an adaptive Levenshtein-based algorithm to better handle long artist names and tolerate small OCR/typo errors.
  • False Positive Safeguards: Implemented logic to prevent spurious matches from simple prefix/suffix expansions (e.g., 'אלי' vs 'יואלי') and excessive differences.
  • Consistent Matching Logic: Ensured exact matching (including 'ו' prefix) is prioritized, with fuzzy matching as a fallback, and updated usages across the application to apply this consistently.
  • New Unit Tests: Added comprehensive unit tests covering various scenarios for the new similarity matching logic, including exact matches, 'ו' prefix, one-letter variations, long-name tolerance, and false positive prevention.
Changelog
  • src/core/check_name.py
    • Added a WORD_PATTERN regular expression for word tokenization.
    • Refactored artist_from_song to directly call check_exact_name for all matching attempts.
    • Modified check_exact_name to include a call to _is_similar_name_match for fuzzy matching after an exact match attempt fails.
    • Introduced _tokenize_words to break text into words based on the new WORD_PATTERN.
    • Added _max_allowed_word_distance and _max_allowed_phrase_distance functions to define adaptive Levenshtein distance thresholds.
    • Implemented _is_prefix_or_suffix_expansion to prevent false positives from single-character prefix/suffix differences.
    • Included a custom _levenshtein_distance function for calculating string similarity.
    • Created _is_similar_name_match to orchestrate the fuzzy matching process, utilizing tokenization, adaptive distance thresholds, and expansion safeguards.
  • src/core/singles_sorter_v5.py
    • Removed redundant if source_name in artist_name checks in handle_album_transfer before calling check_exact_name, ensuring the new fuzzy logic is always applied.
    • Eliminated similar if source_name in split_file checks in artists_from_song (for filename-based matching) to ensure consistent application of check_exact_name.
    • Removed if source_name in artist checks in artists_from_song (for metadata-based matching) to ensure the new check_exact_name logic is consistently used.
    • Removed if source_name in title checks in artists_from_song (for title-based matching) to ensure the new check_exact_name logic is consistently used.
  • src/tests/test_check_name_similarity.py
    • Added a new test file test_check_name_similarity.py to house tests for the updated name matching logic.
    • Included test cases for verifying exact matches and 'ו' prefix support.
    • Added tests to confirm support for one-letter differences in words and short name rejection for similar differences.
    • Implemented tests for long name tolerance with multiple differences and rejection when differences are excessive.
    • Created tests to prevent false positives from prefix and suffix expansions.
    • Added tests for preventing false positives between distinct but similar names.
    • Included tests for multi-word window matching and ensuring window requires the same word count.
Activity
  • The author ran python -m unittest src/tests/test_check_name_similarity.py, and all new unit tests passed.
  • Legacy assertions were run via PYTHONPATH=src/core python src/tests/legacy/assert_check_name.py and completed without errors.
  • A sampling sanity check was performed against src/core/app/singer-list.csv, which verified known negative cases and reported no unexpected false hits in sampled pairs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more sophisticated fuzzy matching algorithm for artist names, replacing a simple one-letter difference check with an adaptive Levenshtein distance-based approach. This is a significant improvement that should increase matching accuracy, especially for longer names, while new safeguards help prevent common false positives. The changes are well-structured, and the inclusion of a comprehensive unit test suite is excellent. I've suggested a minor refactoring to improve readability in one of the new functions.

Comment on lines +168 to +189
total_distance = 0
is_valid_window = True

for candidate_word, artist_word in zip(candidate_words, artist_words):
if _is_prefix_or_suffix_expansion(candidate_word, artist_word):
is_valid_window = False
break

max_word_distance = _max_allowed_word_distance(len(artist_word))
word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)

if word_distance > max_word_distance:
is_valid_window = False
break

total_distance += word_distance
if total_distance > max_phrase_distance:
is_valid_window = False
break

if is_valid_window:
return True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of the is_valid_window flag can be simplified by using a for...else block. The else block of a for loop is executed when the loop completes without encountering a break statement. This makes the code more concise and Pythonic.

Suggested change
total_distance = 0
is_valid_window = True
for candidate_word, artist_word in zip(candidate_words, artist_words):
if _is_prefix_or_suffix_expansion(candidate_word, artist_word):
is_valid_window = False
break
max_word_distance = _max_allowed_word_distance(len(artist_word))
word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)
if word_distance > max_word_distance:
is_valid_window = False
break
total_distance += word_distance
if total_distance > max_phrase_distance:
is_valid_window = False
break
if is_valid_window:
return True
total_distance = 0
for candidate_word, artist_word in zip(candidate_words, artist_words):
if _is_prefix_or_suffix_expansion(candidate_word, artist_word):
break
max_word_distance = _max_allowed_word_distance(len(artist_word))
word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)
if word_distance > max_word_distance:
break
total_distance += word_distance
if total_distance > max_phrase_distance:
break
else:
return True

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbcd335e67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 727 to +729
for source_name, target_name in self.singer_list:
if source_name in split_file:
exact = check_exact_name(split_file, source_name)
if exact:
found_artists.append(target_name)
break # מצאנו אמן, אין צורך להמשיך
exact = check_exact_name(split_file, source_name)
if exact:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run exact singer matching before fuzzy matching

This loop now breaks on the first check_exact_name hit, but check_exact_name includes fuzzy Levenshtein matching, so an earlier near-duplicate in self.singer_list can win before the true exact artist entry is reached. With the current src/core/app/singer-list.csv, an input like אליקם בוטה is matched to the earlier אליקים בוטה row, which routes files to the wrong artist folder even when the exact name exists later in the list.

Useful? React with 👍 / 👎.

if re.search(exact_match_pattern, filename):
return True

return _is_similar_name_match(filename, artist_to_search)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid fuzzy distance checks for every candidate name

Falling back to _is_similar_name_match on every non-exact check makes each singer-list scan run edit-distance logic for almost all rows, including obvious non-candidates. In practice this is a major runtime regression for no-hit inputs (local check against singer-list.csv: ~28.5s for 200 lookups vs ~0.013s previously), so large sorting runs can slow dramatically when filenames/metadata do not match known artists.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant