Refine fuzzy singer matching with adaptive Levenshtein and safeguards by NHLOCAL · Pull Request #26 · NHLOCAL/Singles-Sorter

NHLOCAL · 2026-02-15T03:17:39Z

Motivation

Improve the previous one-letter fuzzy matcher to better handle long artist names while avoiding new false positives.
Make matching robust to small OCR/typo errors in filenames and metadata while preserving exact and ו-prefixed matches.
Prevent spurious matches introduced by simple prefix/suffix expansions (for example אלי vs יואלי or מוטי vs למוטי).

Description

Replaced the prior one-letter-only logic with an adaptive Levenshtein-based matcher in src/core/check_name.py that sets per-word and per-phrase distance thresholds via _max_allowed_word_distance and _max_allowed_phrase_distance.
Added tokenization (_tokenize_words), an optimized Levenshtein implementation (_levenshtein_distance), and a safeguard to reject prefix/suffix expansions (_is_prefix_or_suffix_expansion).
Kept exact matching (including optional ו prefix) as the first check in check_exact_name and only falls back to fuzzy matching when exact search fails.
Updated usages in src/core/singles_sorter_v5.py to call check_exact_name directly (removed the preliminary if source_name in ... containment check) so the new fuzzy logic is applied consistently.
Added unit tests in src/tests/test_check_name_similarity.py covering exact matches, ו prefix, one-letter variations, long-name tolerance, rejection on excessive differences, prevention of prefix/suffix false-positives, multi-word window matching, and window-size requirements.

Testing

Ran python -m unittest src/tests/test_check_name_similarity.py and all tests passed (OK).
Ran legacy assertions via PYTHONPATH=src/core python src/tests/legacy/assert_check_name.py which completed without errors.
Performed a sampling sanity check against src/core/app/singer-list.csv that verified known negative case יואלי קליין vs אלי קליין and reported no unexpected false hits in the sampled pairs.

Codex Task

gemini-code-assist · 2026-02-15T03:17:57Z

Summary of Changes

Hello @NHLOCAL, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and accuracy of singer name matching by introducing an adaptive Levenshtein distance algorithm. The changes aim to improve the system's ability to identify artists despite minor discrepancies or longer names, while critically avoiding incorrect associations that could arise from partial word matches or common prefixes. This ensures more reliable artist identification across the application.

Highlights

Improved Fuzzy Matching: Replaced the previous one-letter fuzzy matcher with an adaptive Levenshtein-based algorithm to better handle long artist names and tolerate small OCR/typo errors.
False Positive Safeguards: Implemented logic to prevent spurious matches from simple prefix/suffix expansions (e.g., 'אלי' vs 'יואלי') and excessive differences.
Consistent Matching Logic: Ensured exact matching (including 'ו' prefix) is prioritized, with fuzzy matching as a fallback, and updated usages across the application to apply this consistently.
New Unit Tests: Added comprehensive unit tests covering various scenarios for the new similarity matching logic, including exact matches, 'ו' prefix, one-letter variations, long-name tolerance, and false positive prevention.

Changelog

src/core/check_name.py
- Added a WORD_PATTERN regular expression for word tokenization.
- Refactored artist_from_song to directly call check_exact_name for all matching attempts.
- Modified check_exact_name to include a call to _is_similar_name_match for fuzzy matching after an exact match attempt fails.
- Introduced _tokenize_words to break text into words based on the new WORD_PATTERN.
- Added _max_allowed_word_distance and _max_allowed_phrase_distance functions to define adaptive Levenshtein distance thresholds.
- Implemented _is_prefix_or_suffix_expansion to prevent false positives from single-character prefix/suffix differences.
- Included a custom _levenshtein_distance function for calculating string similarity.
- Created _is_similar_name_match to orchestrate the fuzzy matching process, utilizing tokenization, adaptive distance thresholds, and expansion safeguards.
src/core/singles_sorter_v5.py
- Removed redundant if source_name in artist_name checks in handle_album_transfer before calling check_exact_name, ensuring the new fuzzy logic is always applied.
- Eliminated similar if source_name in split_file checks in artists_from_song (for filename-based matching) to ensure consistent application of check_exact_name.
- Removed if source_name in artist checks in artists_from_song (for metadata-based matching) to ensure the new check_exact_name logic is consistently used.
- Removed if source_name in title checks in artists_from_song (for title-based matching) to ensure the new check_exact_name logic is consistently used.
src/tests/test_check_name_similarity.py
- Added a new test file test_check_name_similarity.py to house tests for the updated name matching logic.
- Included test cases for verifying exact matches and 'ו' prefix support.
- Added tests to confirm support for one-letter differences in words and short name rejection for similar differences.
- Implemented tests for long name tolerance with multiple differences and rejection when differences are excessive.
- Created tests to prevent false positives from prefix and suffix expansions.
- Added tests for preventing false positives between distinct but similar names.
- Included tests for multi-word window matching and ensuring window requires the same word count.

Activity

The author ran python -m unittest src/tests/test_check_name_similarity.py, and all new unit tests passed.
Legacy assertions were run via PYTHONPATH=src/core python src/tests/legacy/assert_check_name.py and completed without errors.
A sampling sanity check was performed against src/core/app/singer-list.csv, which verified known negative cases and reported no unexpected false hits in sampled pairs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a more sophisticated fuzzy matching algorithm for artist names, replacing a simple one-letter difference check with an adaptive Levenshtein distance-based approach. This is a significant improvement that should increase matching accuracy, especially for longer names, while new safeguards help prevent common false positives. The changes are well-structured, and the inclusion of a comprehensive unit test suite is excellent. I've suggested a minor refactoring to improve readability in one of the new functions.

gemini-code-assist · 2026-02-15T03:20:01Z

src/core/check_name.py

+        total_distance = 0
+        is_valid_window = True
+
+        for candidate_word, artist_word in zip(candidate_words, artist_words):
+            if _is_prefix_or_suffix_expansion(candidate_word, artist_word):
+                is_valid_window = False
+                break
+
+            max_word_distance = _max_allowed_word_distance(len(artist_word))
+            word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)
+
+            if word_distance > max_word_distance:
+                is_valid_window = False
+                break
+
+            total_distance += word_distance
+            if total_distance > max_phrase_distance:
+                is_valid_window = False
+                break
+
+        if is_valid_window:
+            return True


The use of the is_valid_window flag can be simplified by using a for...else block. The else block of a for loop is executed when the loop completes without encountering a break statement. This makes the code more concise and Pythonic.

Suggested change

total_distance = 0

is_valid_window = True

for candidate_word, artist_word in zip(candidate_words, artist_words):

if _is_prefix_or_suffix_expansion(candidate_word, artist_word):

is_valid_window = False

break

max_word_distance = _max_allowed_word_distance(len(artist_word))

word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)

if word_distance > max_word_distance:

is_valid_window = False

break

total_distance += word_distance

if total_distance > max_phrase_distance:

is_valid_window = False

break

if is_valid_window:

return True

total_distance = 0

for candidate_word, artist_word in zip(candidate_words, artist_words):

if _is_prefix_or_suffix_expansion(candidate_word, artist_word):

break

max_word_distance = _max_allowed_word_distance(len(artist_word))

word_distance = _levenshtein_distance(candidate_word, artist_word, max_word_distance)

if word_distance > max_word_distance:

break

total_distance += word_distance

if total_distance > max_phrase_distance:

break

else:

return True

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbcd335e67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-15T03:25:47Z

src/core/singles_sorter_v5.py

        for source_name, target_name in self.singer_list:
-            if source_name in split_file:
-                exact = check_exact_name(split_file, source_name)
-                if exact:
-                    found_artists.append(target_name)
-                    break  # מצאנו אמן, אין צורך להמשיך
+            exact = check_exact_name(split_file, source_name)
+            if exact:


Run exact singer matching before fuzzy matching

This loop now breaks on the first check_exact_name hit, but check_exact_name includes fuzzy Levenshtein matching, so an earlier near-duplicate in self.singer_list can win before the true exact artist entry is reached. With the current src/core/app/singer-list.csv, an input like אליקם בוטה is matched to the earlier אליקים בוטה row, which routes files to the wrong artist folder even when the exact name exists later in the list.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-15T03:25:47Z

src/core/check_name.py

    if re.search(exact_match_pattern, filename):
        return True

+    return _is_similar_name_match(filename, artist_to_search)


Avoid fuzzy distance checks for every candidate name

Falling back to _is_similar_name_match on every non-exact check makes each singer-list scan run edit-distance logic for almost all rows, including obvious non-candidates. In practice this is a major runtime regression for no-hit inputs (local check against singer-list.csv: ~28.5s for 200 lookups vs ~0.013s previously), so large sorting runs can slow dramatically when filenames/metadata do not match known artists.

Useful? React with 👍 / 👎.

Refine fuzzy singer matching with adaptive distance safeguards

fbcd335

NHLOCAL added the codex label Feb 15, 2026 — with ChatGPT Codex Connector

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

Refactor fuzzy match loop to Pythonic for-else

bea872d

chatgpt-codex-connector bot reviewed Feb 15, 2026

View reviewed changes

Update .gitignore

12e6147

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26

Refine fuzzy singer matching with adaptive Levenshtein and safeguards#26
NHLOCAL wants to merge 3 commits intomainfrom
codex/improve-artist-name-recognition-logic

NHLOCAL commented Feb 15, 2026

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 15, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NHLOCAL commented Feb 15, 2026

Motivation

Description

Testing

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant