Skip to content

Improve DeduplicationModule for production use #97

@ooples

Description

@ooples

Problem

The DeduplicationModule currently uses simplified implementations that have known limitations for production use. While functional for basic exact-match deduplication, it needs improvements for robustness and accuracy.

Current Limitations

1. Naive Sentence Splitting

Issue: Uses regex split(/([.!?]+\s+)/) which breaks on abbreviations
Examples that fail:

  • "Dr. Smith went to the U.S. embassy."
    → Splits into: "Dr", " Smith went to the U", " S", " embassy"
  • "The price is $5.99. The total is $10.00."
    → Incorrectly splits on decimal points

Impact: False positives (treating sentence fragments as duplicates) and incorrect deduplication

Recommended fix: Use proper sentence tokenization:

  • Intl.Segmenter API (Node 16+) with granularity: 'sentence'
  • Or lightweight NLP library like compromise or natural

2. No Fuzzy/Semantic Matching

Issue: Removed similarityThreshold option (was unused)
Examples that should match but don't:

  • "The quick brown fox" vs "The fast brown fox"
  • "API returns 404" vs "API returned 404 error"

Recommended fix:

  • Add Levenshtein distance or Jaccard similarity
  • Or use vector embeddings for semantic similarity (cosine distance)

3. preserveFirst=false Not Implemented

Issue: Option documented but not functional
Current behavior: Always preserves first occurrence regardless of setting

Recommended fix:

  • Track last occurrence index in a Map
  • Replace earlier occurrence when duplicate found

4. Paragraph Deduplication Normalizes Spacing

Issue: split(/\n\s*\n/) and join('\n\n') collapses multiple blank lines
Impact: Changes original formatting even when no duplicates found

Recommended fix:

  • Split with capturing group: /(\r?\n\s*\r?\n)/
  • Preserve original separators for non-duplicate paragraphs

5. Weak Code Block Detection

Issue: Regex /```[\s\S]*?```/g matches inline fences
Example: Text like "Use ```bash" mid-sentence will create placeholder

Recommended fix: Use stricter multiline anchors

/^```[\s\S]*?^```/gm

Priority

Medium-High - Module is functional for basic use cases but has rough edges that could cause issues in production with:

  • Technical documentation (many abbreviations)
  • Code examples with inline backticks
  • Content requiring precise formatting preservation

Suggested Approach

  1. Quick win: Fix code block regex (5 min)
  2. Short term: Implement preserveFirst=false properly (30 min)
  3. Medium term: Add proper sentence tokenization (2-3 hours)
  4. Long term: Add fuzzy/semantic matching (1 day)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions