Skip to content

Correct label errors in locomo10.json QA #27

@dial481

Description

@dial481

I'm running the benchmark on a new project I've been working on and it identified three benchmark label errors in locomo10.json that penalize correct model reasoning.

Question Line Issue Answer Key Says Transcript Says Error Type
Q1 21 Wrong field "Psychology, counseling certification" "counseling and mental health" (no psychology, no certification) Hallucination
Q2 46 Wrong day "Sunday before 25 May 2023" "last Saturday" Temporal Mismatch
Q3 1155 Wrong person Caroline shared "abstract painting with blue streaks" Caroline shared drawing; Melanie shared paintings Speaker Misattribution

Line 21:

{
"question": "What fields would Caroline be likely to pursue in her educaton?",
"answer": "Psychology, counseling certification",
"evidence": [
"D1:9",
"D1:11"
],
"category": 3
},

Issue: Psychology is not mentioned in that conversation at all, although counseling and mental health is explicitly mentioned. "Psychology" is mentioned in another conversation between Tim and John (line 28587). "Certificate" and "certification" are also mentioned in other conversations but not between Caroline and Melanie.

Line 4410:

[
"Caroline is planning to continue her education and explore career options in counseling or mental health to support those with similar issues.",
"D1:9"
]

Line 1708:

{
"speaker": "Caroline",
"dia_id": "D1:11",
"text": "I'm keen on counseling or working in mental health - I'd love to support those with similar issues."
},

Line 46:

{
"question": "When did Melanie run a charity race?",
"answer": "The sunday before 25 May 2023",
"evidence": [
"D2:1"
],

Issue: The transcript clearly indicates "last Saturday" not "last Sunday".

Line 1754:

"session_2_date_time": "1:14 pm on 25 May, 2023",
"session_2": [
{
"speaker": "Melanie",
"dia_id": "D2:1",
"text": "Hey Caroline, since we last chatted, I've had a lot of things happening to me. I ran a charity race for mental health last Saturday \u2013 it was really rewarding. Really made me think about taking care of our minds."
},

Line 1155:

{
"question": "What kind of painting did Caroline share with Melanie on October 13, 2023?",
"answer": "An abstract painting with blue streaks on a wall.",
"evidence": [
"D17:14"
],
"category": 4
},

Issue: The answer key attributes Melanie's painting to Caroline. According to the transcript, Caroline shared a drawing of a woman in a dress (D17:14, BLIP caption: "a photo of a drawing of a woman in a dress"). Melanie shared the paintings, including the one with "blue streaks on a wall." This creates an unintentionally adversarial question.

Line 3977:

"speaker": "Caroline",
"img_url": [
"https://i.redd.it/50qvgfuva33b1.jpg"
],
"blip_caption": "a photo of a drawing of a woman in a dress",

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions