Skip to content

Add SingNER dataset-driven pytest coverage and refresh existing tests#27

Open
NHLOCAL wants to merge 1 commit intodevfrom
codex/create-comprehensive-tests-for-singner-dataset
Open

Add SingNER dataset-driven pytest coverage and refresh existing tests#27
NHLOCAL wants to merge 1 commit intodevfrom
codex/create-comprehensive-tests-for-singner-dataset

Conversation

@NHLOCAL
Copy link
Owner

@NHLOCAL NHLOCAL commented Feb 15, 2026

Motivation

  • Improve confidence in the singer-name matching logic by validating it against a large, real-world labeled dataset.
  • Make dataset-driven tests reproducible and runnable offline by saving a deterministic sample locally.
  • Migrate existing unittest-style checks to a native pytest layout for easier parametrization and integration.

Description

  • Converted the legacy unittest test in src/tests/test_check_name_similarity.py to a parameterized pytest test using @pytest.mark.parametrize for clearer, pytest-native cases.
  • Added a deterministic 5,000-row sample of NHLOCAL/SingNER at src/tests/data/singner_sample_5000.jsonl and a helper script src/tests/data/download_singner_sample.py to (re)download the same sample from the Hugging Face dataset.
  • Introduced a new dataset-driven test module src/tests/test_singner_dataset_matching.py that loads the sample, extracts SINGER entities and runs comprehensive validations (direct matches, vav-prefix variants, prefix/suffix negatives, no-answer checks), assembling 120+ edge cases derived from the sample.
  • Small hygiene edits to remove unused imports and random seed remnants so tests are deterministic and self-contained.

Testing

  • Confirmed the sample file contains exactly 5,000 rows with wc -l src/tests/data/singner_sample_5000.jsonl.
  • Verified the Hugging Face dataset can be loaded and a sample written via load_dataset('NHLOCAL/SingNER') and the included download_singner_sample.py script.
  • Ran the full test suite with pytest -q and observed all tests passing (17 passed).
  • Installed datasets as part of validation and exercised the dataset loading flow used to generate the committed sample.

Codex Task

@gemini-code-assist
Copy link

Summary of Changes

Hello @NHLOCAL, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and maintainability of the singer-name matching logic by modernizing its testing infrastructure. It transitions from a legacy unittest framework to pytest and introduces a comprehensive, dataset-driven validation approach using a real-world labeled dataset. These changes aim to improve confidence in the matching algorithm's accuracy and provide a more scalable and reproducible testing environment.

Highlights

  • Test Framework Migration: Migrated existing unittest-style checks in test_check_name_similarity.py to a native pytest layout using @pytest.mark.parametrize for enhanced parametrization and integration.
  • Dataset Integration: Incorporated a deterministic 5,000-row sample of the NHLOCAL/SingNER dataset, along with a helper script to download and refresh this sample, ensuring reproducible offline testing.
  • New Dataset-Driven Tests: Introduced a new test module, test_singner_dataset_matching.py, which leverages the SingNER sample to validate the singer-name matching logic against over 120 comprehensive edge cases, including direct matches, prefix variants, and negative scenarios.
  • Code Hygiene: Performed minor hygiene edits to remove unused imports and random seed remnants, contributing to more deterministic and self-contained tests.
Changelog
  • src/tests/data/download_singner_sample.py
    • Added a new Python script to download a 5,000-row sample from the NHLOCAL/SingNER Hugging Face dataset.
    • Implemented logic to save the sample as singner_sample_5000.jsonl for offline use.
  • src/tests/test_check_name_similarity.py
    • Removed unittest imports and class structure.
    • Converted all individual test methods into a single parameterized pytest function using @pytest.mark.parametrize.
    • Updated assertions from self.assertTrue/self.assertFalse to assert ... is expected.
  • src/tests/test_singner_dataset_matching.py
    • Added a new pytest test module.
    • Included fixtures (singner_rows, singer_mentions) to load and process the SingNER dataset sample.
    • Implemented tests to verify dataset row count, singer entity matching ratio, and comprehensive edge cases derived from the dataset.
    • Added a test to ensure no matches occur when the artist is not present in the text.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76b222577d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



def main() -> None:
dataset = load_dataset("NHLOCAL/SingNER", split="train")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin SingNER dataset revision before sampling

The sampling script currently calls load_dataset("NHLOCAL/SingNER", split="train") without a pinned revision, so rerunning it after any upstream dataset update can produce a different 5,000-row file even though the script/docstring describes the sample as deterministic. This undermines reproducibility for anyone regenerating singner_sample_5000.jsonl in a later environment; pinning a specific dataset revision (or commit hash) avoids silent drift.

Useful? React with 👍 / 👎.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great pull request that significantly improves the testing strategy for the singer-name matching logic. Migrating the existing tests to use pytest.mark.parametrize is a solid improvement for readability, and the introduction of dataset-driven tests using a real-world sample from SingNER will greatly increase confidence in the algorithm's correctness. The new download script is also a welcome addition for reproducibility.

I've added a few comments with suggestions for improvement, mainly around enhancing test correctness and maintainability by removing magic numbers and improving code conciseness. Overall, these are excellent changes.

@pytest.fixture(scope="module")
def singner_rows() -> list[dict]:
rows = _load_dataset_sample()
assert len(rows) >= 5000, "Expected at least 5000 rows in sample"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion len(rows) >= 5000 is too lenient. The dataset sample is expected to have exactly 5000 rows, as specified in the download script (SAMPLE_SIZE = 5000) and confirmed in test_dataset_row_count_is_significant. Using == ensures the fixture provides the exact data the tests expect, preventing potential issues if the sample file changes unexpectedly.

Suggested change
assert len(rows) >= 5000, "Expected at least 5000 rows in sample"
assert len(rows) == 5000, "Expected exactly 5000 rows in sample"

Comment on lines +20 to +27
for row in sample:
output_file.write(
json.dumps(
{"text": row["text"], "entities": row["entities"]},
ensure_ascii=False,
)
+ "\n"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop can be written more concisely and potentially more performantly using a generator expression inside str.join(). This avoids repeated write() calls within the loop and is generally more Pythonic.

Suggested change
for row in sample:
output_file.write(
json.dumps(
{"text": row["text"], "entities": row["entities"]},
ensure_ascii=False,
)
+ "\n"
)
output_file.write(
"\n".join(
json.dumps(
{"text": row["text"], "entities": row["entities"]},
ensure_ascii=False,
)
for row in sample
)
+ "\n"
)

Comment on lines +65 to +115
def test_singer_entities_match_in_original_text(singer_mentions: list[tuple[str, str]]) -> None:
sampled_mentions = singer_mentions[:1500]
matched = sum(1 for text, singer in sampled_mentions if check_exact_name(text, singer))
ratio = matched / len(sampled_mentions)

assert ratio >= 0.98, f"Unexpectedly low match ratio on dataset sample: {ratio:.3f}"


def test_comprehensive_edge_cases_from_dataset(singer_mentions: list[tuple[str, str]]) -> None:
unique_singers = _pick_unique_singers(singer_mentions, limit=300)
assert len(unique_singers) >= 120

edge_cases: list[tuple[str, str, bool]] = []

# 1) Positive direct cases from dataset (40)
for text, singer in singer_mentions[:40]:
edge_cases.append((text, singer, True))

# 2) Positive with optional Hebrew vav-prefix (20)
vav_candidates = [s for s in unique_singers if not s.startswith("ו")]
for singer in vav_candidates[:20]:
edge_cases.append((f"שיר חי עם ו{singer}", singer, True))

# 3) Negative: prefix expansion should not match (20)
for singer in unique_singers[20:40]:
edge_cases.append((f"ל{singer} בהופעה", singer, False))

# 4) Negative: suffix expansion should not match (20)
for singer in unique_singers[40:60]:
edge_cases.append((f"{singer}ל בהופעה", singer, False))

# 5) Negative: text with no realistic artist mention (20)
for singer in unique_singers[60:80]:
edge_cases.append(("zzzzq xxyy 12345 ללא זמר מזוהה", singer, False))

assert len(edge_cases) >= 120

failed_cases = [
(filename, artist, expected)
for filename, artist, expected in edge_cases
if check_exact_name(filename, artist) is not expected
]

assert not failed_cases, f"Found {len(failed_cases)} failing edge cases"


def test_no_answer_when_artist_not_present(singer_mentions: list[tuple[str, str]]) -> None:
unique_singers = _pick_unique_singers(singer_mentions, limit=100)
text_without_singer = "גרסת בדיקה ללא התאמה כלל 987654321"

assert all(not check_exact_name(text_without_singer, singer) for singer in unique_singers)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test file uses several "magic numbers" which make the tests harder to read and maintain. Please extract these numbers into named constants at the module level. This will make the intent of the code clearer and future modifications easier.

Examples of magic numbers to replace:

  • In test_singer_entities_match_in_original_text: 1500 and 0.98.
  • In test_comprehensive_edge_cases_from_dataset: 300, 120, 40, 20, and the slicing indices.
  • In test_no_answer_when_artist_not_present: 100.

For example:

# At module level
SAMPLED_MENTIONS_COUNT = 1500
MIN_MATCH_RATIO = 0.98
UNIQUE_SINGER_LIMIT = 300
# ... etc.

# In test_singer_entities_match_in_original_text
def test_singer_entities_match_in_original_text(singer_mentions: list[tuple[str, str]]) -> None:
    sampled_mentions = singer_mentions[:SAMPLED_MENTIONS_COUNT]
    # ...
    assert ratio >= MIN_MATCH_RATIO, f"..."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant