Skip to content

gut-puncture/Long-Number-Addition

Repository files navigation

Universal Length Bias in Large Language Models

Why do state-of-the-art LLMs struggle with simple addition when numbers get large?

This repository contains the code and data analysis for "Universal Length Bias" (or Semantic Length Hallucination), a phenomenon where LLMs (specifically Qwen 2.5 7B) prioritize a learned output template ("N+1 digits") over the actual arithmetic result.

The Discovery

We found that Qwen 2.5 7B is more accurate on "Hard" addition problems (where a carry causes an overflow, e.g., 99... + 1... = 100...) than on "Easy" problems (where there is no overflow).

  • Easy (No Overflow): Accuracy drops to ~20% at 10 digits.
  • Hard (Overflow): Accuracy is ~44% at 10 digits.

This paradox occurs because the model has a structural bias to generate an answer that is one digit longer than the inputs. When the math aligns with this bias (Hard problems), it succeeds. When the math conflicts (Easy problems), the bias overrides the computation.

Repository Structure

  • unified_analysis.py: The main script used to run the experiment on the GPU. It handles data generation, model inference, activation caching, and logit lens extraction.
  • output/: Contains the results and figures.
    • unified_analysis_results.json: Raw data from the experiment.
    • final_*.png: Visualizations of the findings.
  • output/data_analysis/: Contains the plotting scripts to reproduce the figures.
    • plot_final.py: The script used to generate the charts in output/.
  • MEDIUM_ARTICLE.md: A comprehensive write-up of the findings and the story behind the data.
  • experiment_log.md: Chronological log of the research process.

Reproducing the Analysis

1. Run the GPU Experiment

(Requires a GPU with ~24GB VRAM for Qwen 2.5 7B Float16)

python unified_analysis.py

This will generate unified_analysis_results.json.

2. Generate the Plots

(Can be run locally on CPU)

cd output/data_analysis
python plot_final.py

This will generate the visualizations in the output/ directory (reading the JSON from the parent directory).

Key Visualizations

  • The Efficiency Paradox: Shows the accuracy gap between Easy and Hard problems.
  • The Signal War: Demonstrates how the "Bias Token" probability overtakes the "Correct Answer" probability in deep layers.
  • Internal Divergence: Tracks the deviation of the model's internal state when it fails.

Citation

If you use this analysis, please link back to this repository.

About

Analysis of the accuracy of Qwen 2.5 7B model on performing calculations for numbers with 10-18 digits. We found that for large numbers, the model has a bias to produce an extra digit, even when not needed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors