LoCoMo Benchmark Audit

Independent audit of the LoCoMo (Long-Context Modeling) benchmark and the EverMemOS evaluation framework. Findings cover ground truth errors in the dataset, evaluation methodology differences across implementations, token cost misrepresentation, judge leniency, and third-party reproducibility failures. Every claim links to a verifiable primary source.

Key Findings

Finding	Detail	Source
Ground truth errors	99 of 1,540 questions (6.4%) have wrong golden answers. Theoretical scoring ceiling is 93.57%.	AUDIT_REPORT.md
Total token cost	EverMemOS README claims 2,298 avg tokens per question. The paper's own Table 8 (arXiv:2601.02163v2) shows 6,669 with GPT-4.1-mini (2.9x higher; 6,045 with GPT-4o-mini). Real reduction vs. full-context is 67%, not 89%.	methodology/token_efficiency.md
Judge accepts wrong answers	62.81% of intentionally wrong vague-but-topical answers accepted by the LLM judge.	ap-baseline/README.md
Scores exceed corrupted ceiling	EverMemOS single-hop (95.96%) and multi-hop (91.37%) exceed their category ceilings (95.72% and 90.07%), mathematically impossible without credit from wrong golden answers. Overall 92.32% is within 1.25 points of the 93.57% aggregate ceiling.	results-audit/RESULTS_AUDIT.md
Not apples-to-apples	EverMemOS uses 2-3 sequential LLM calls, a 729-token CoT prompt, and agentic retrieval. All other systems: 1 call, simple prompt, no overhead. All reported in the same "Avg. Tokens" column.	methodology/token_efficiency.md, methodology/prompts.md
Reproducibility failures	Third parties report 38.38% vs. claimed 92.32% (EverMemOS#73). Multiple Mem0 reproducibility issues open.	methodology/reproducibility.md
Full-context baseline exceeds EverMemOS	GPT-4.1-mini with `answer_prompt_cot` on full context scores 92.62%, exceeding EverMemOS (92.32%) and the claimed FC baseline (91.21%). The answer prompt, not the memory system, explains the score.	fc-baseline/README.md

Repository Structure

locomo-audit/
├── data/
│   └── locomo10.json              # Original dataset (unmodified, SHA256-verified)
├── audit/
│   ├── conv_0.json ... conv_9.json          # Per-conversation audit packages
│   └── errors_conv_0.json ... errors_conv_9.json  # Errors found per conversation
├── results-audit/                 # Score impact analysis across 5 published systems
│   ├── RESULTS_AUDIT.md           # Adjusted scores, ceiling analysis, cross-check
│   ├── audit_results.py           # Audit script (LLM judge, ~1,485 calls)
│   └── download_results.py        # Fetches published eval_results from HuggingFace
├── ap-baseline/                   # Judge leniency stress test
│   ├── README.md                  # Strategies, results, 6x leniency finding
│   ├── score_ap.py                # Scoring pipeline (same judge as original eval)
│   ├── v1/                        # Specific-but-wrong strategy (10.61%)
│   └── v2/                        # Vague-but-topical strategy (62.81%)
├── fc-baseline/                   # Independent full-context baseline (4 runs, 2 models x 2 prompts)
│   ├── README.md                  # Methodology, results, key finding (prompt explains gap)
│   ├── scripts/                   # fc_eval.py (~860 lines) and analyze_results.py
│   └── results/                   # eval_results.json for all 4 runs
├── methodology/                   # Evaluation methodology analysis
│   ├── README.md                  # Overview and key findings
│   ├── prompts.md                 # Answer prompts, judge prompt, context templates
│   ├── word_counts.md             # Answer length statistics and scoring correlation
│   ├── token_efficiency.md        # Token cost claims vs. paper's own data
│   ├── discrepancies.md           # Cross-repository model, prompt, scoring differences
│   ├── full_context_baseline.md   # Full-context baselines: 4 measured runs, prompt explains the gap
│   ├── image_questions.md         # Image-dependent questions and BLIP caption handling
│   ├── reproducibility.md         # Third-party reproducibility reports
│   └── scripts/                   # Analysis scripts (stdlib-only Python)
├── evaluation/
│   └── config/
│       └── prompts.yaml           # Judge prompts (from EverMemOS pipeline, SHA256-verified)
├── scripts/
│   └── verify_sha256.py           # Verify dataset integrity against known hashes
├── errors.json                    # Consolidated error report (all conversations)
├── AUDIT_REPORT.md                # Ground truth audit: full findings and analysis
├── requirements.txt               # Python dependencies (openai, pyyaml)
└── README.md

Provenance

File	Source	License	SHA256
`data/locomo10.json`	`snap-research/locomo`	CC BY-NC 4.0	`79fa87e9...ea698ff4`
`evaluation/config/prompts.yaml`	`EverMind-AI/EverMemOS`	Apache 2.0	`ba4f668e...ba498ee9`

Both files are byte-for-byte matches with their official upstream sources (verified Feb 2026). Run python scripts/verify_sha256.py to confirm. See THIRD-PARTY-NOTICES.md for full license attribution.

Prior Work

This audit builds on errors first reported in snap-research/locomo#27 (29 errors). Our systematic audit found 156 total issues: 99 score-corrupting, 57 citation-only.

License

This work is licensed under CC BY-NC 4.0, the same license as the underlying LoCoMo dataset.

The LoCoMo dataset was created by Maharana, A., Lee, D. H., Tuber, S., & Bansal, M. and is published by SNAP Research under CC BY-NC 4.0. The unmodified dataset is included in data/locomo10.json (SHA256-verified). This repository contains audit annotations and analysis derived from that dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoCoMo Benchmark Audit

Key Findings

Repository Structure

Provenance

Prior Work

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ap-baseline		ap-baseline
audit		audit
data		data
evaluation/config		evaluation/config
fc-baseline		fc-baseline
methodology		methodology
results-audit		results-audit
scripts		scripts
.gitignore		.gitignore
AUDIT_REPORT.md		AUDIT_REPORT.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-NOTICES.md		THIRD-PARTY-NOTICES.md
errors.json		errors.json
requirements.txt		requirements.txt

License

dial481/locomo-audit

Folders and files

Latest commit

History

Repository files navigation

LoCoMo Benchmark Audit

Key Findings

Repository Structure

Provenance

Prior Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages