-
Notifications
You must be signed in to change notification settings - Fork 66
Description
I'm running the benchmark on a new project I've been working on and it identified three benchmark label errors in locomo10.json that penalize correct model reasoning.
| Question | Line | Issue | Answer Key Says | Transcript Says | Error Type |
|---|---|---|---|---|---|
| Q1 | 21 | Wrong field | "Psychology, counseling certification" | "counseling and mental health" (no psychology, no certification) | Hallucination |
| Q2 | 46 | Wrong day | "Sunday before 25 May 2023" | "last Saturday" | Temporal Mismatch |
| Q3 | 1155 | Wrong person | Caroline shared "abstract painting with blue streaks" | Caroline shared drawing; Melanie shared paintings | Speaker Misattribution |
Line 21:
{
"question": "What fields would Caroline be likely to pursue in her educaton?",
"answer": "Psychology, counseling certification",
"evidence": [
"D1:9",
"D1:11"
],
"category": 3
},
Issue: Psychology is not mentioned in that conversation at all, although counseling and mental health is explicitly mentioned. "Psychology" is mentioned in another conversation between Tim and John (line 28587). "Certificate" and "certification" are also mentioned in other conversations but not between Caroline and Melanie.
Line 4410:
[
"Caroline is planning to continue her education and explore career options in counseling or mental health to support those with similar issues.",
"D1:9"
]
Line 1708:
{
"speaker": "Caroline",
"dia_id": "D1:11",
"text": "I'm keen on counseling or working in mental health - I'd love to support those with similar issues."
},
Line 46:
{
"question": "When did Melanie run a charity race?",
"answer": "The sunday before 25 May 2023",
"evidence": [
"D2:1"
],
Issue: The transcript clearly indicates "last Saturday" not "last Sunday".
Line 1754:
"session_2_date_time": "1:14 pm on 25 May, 2023",
"session_2": [
{
"speaker": "Melanie",
"dia_id": "D2:1",
"text": "Hey Caroline, since we last chatted, I've had a lot of things happening to me. I ran a charity race for mental health last Saturday \u2013 it was really rewarding. Really made me think about taking care of our minds."
},
Line 1155:
{
"question": "What kind of painting did Caroline share with Melanie on October 13, 2023?",
"answer": "An abstract painting with blue streaks on a wall.",
"evidence": [
"D17:14"
],
"category": 4
},
Issue: The answer key attributes Melanie's painting to Caroline. According to the transcript, Caroline shared a drawing of a woman in a dress (D17:14, BLIP caption: "a photo of a drawing of a woman in a dress"). Melanie shared the paintings, including the one with "blue streaks on a wall." This creates an unintentionally adversarial question.
Line 3977:
"speaker": "Caroline",
"img_url": [
"https://i.redd.it/50qvgfuva33b1.jpg"
],
"blip_caption": "a photo of a drawing of a woman in a dress",