Problem
The DeduplicationModule currently uses simplified implementations that have known limitations for production use. While functional for basic exact-match deduplication, it needs improvements for robustness and accuracy.
Current Limitations
1. Naive Sentence Splitting
Issue: Uses regex split(/([.!?]+\s+)/) which breaks on abbreviations
Examples that fail:
- "Dr. Smith went to the U.S. embassy."
→ Splits into: "Dr", " Smith went to the U", " S", " embassy"
- "The price is $5.99. The total is $10.00."
→ Incorrectly splits on decimal points
Impact: False positives (treating sentence fragments as duplicates) and incorrect deduplication
Recommended fix: Use proper sentence tokenization:
Intl.Segmenter API (Node 16+) with granularity: 'sentence'
- Or lightweight NLP library like
compromise or natural
2. No Fuzzy/Semantic Matching
Issue: Removed similarityThreshold option (was unused)
Examples that should match but don't:
- "The quick brown fox" vs "The fast brown fox"
- "API returns 404" vs "API returned 404 error"
Recommended fix:
- Add Levenshtein distance or Jaccard similarity
- Or use vector embeddings for semantic similarity (cosine distance)
3. preserveFirst=false Not Implemented
Issue: Option documented but not functional
Current behavior: Always preserves first occurrence regardless of setting
Recommended fix:
- Track last occurrence index in a Map
- Replace earlier occurrence when duplicate found
4. Paragraph Deduplication Normalizes Spacing
Issue: split(/\n\s*\n/) and join('\n\n') collapses multiple blank lines
Impact: Changes original formatting even when no duplicates found
Recommended fix:
- Split with capturing group:
/(\r?\n\s*\r?\n)/
- Preserve original separators for non-duplicate paragraphs
5. Weak Code Block Detection
Issue: Regex /```[\s\S]*?```/g matches inline fences
Example: Text like "Use ```bash" mid-sentence will create placeholder
Recommended fix: Use stricter multiline anchors
Priority
Medium-High - Module is functional for basic use cases but has rough edges that could cause issues in production with:
- Technical documentation (many abbreviations)
- Code examples with inline backticks
- Content requiring precise formatting preservation
Suggested Approach
- Quick win: Fix code block regex (5 min)
- Short term: Implement preserveFirst=false properly (30 min)
- Medium term: Add proper sentence tokenization (2-3 hours)
- Long term: Add fuzzy/semantic matching (1 day)
References
Problem
The
DeduplicationModulecurrently uses simplified implementations that have known limitations for production use. While functional for basic exact-match deduplication, it needs improvements for robustness and accuracy.Current Limitations
1. Naive Sentence Splitting
Issue: Uses regex
split(/([.!?]+\s+)/)which breaks on abbreviationsExamples that fail:
→ Splits into: "Dr", " Smith went to the U", " S", " embassy"
→ Incorrectly splits on decimal points
Impact: False positives (treating sentence fragments as duplicates) and incorrect deduplication
Recommended fix: Use proper sentence tokenization:
Intl.SegmenterAPI (Node 16+) withgranularity: 'sentence'compromiseornatural2. No Fuzzy/Semantic Matching
Issue: Removed
similarityThresholdoption (was unused)Examples that should match but don't:
Recommended fix:
3.
preserveFirst=falseNot ImplementedIssue: Option documented but not functional
Current behavior: Always preserves first occurrence regardless of setting
Recommended fix:
4. Paragraph Deduplication Normalizes Spacing
Issue:
split(/\n\s*\n/)andjoin('\n\n')collapses multiple blank linesImpact: Changes original formatting even when no duplicates found
Recommended fix:
/(\r?\n\s*\r?\n)/5. Weak Code Block Detection
Issue: Regex
/```[\s\S]*?```/gmatches inline fencesExample: Text like "Use ```bash" mid-sentence will create placeholder
Recommended fix: Use stricter multiline anchors
/^```[\s\S]*?^```/gmPriority
Medium-High - Module is functional for basic use cases but has rough edges that could cause issues in production with:
Suggested Approach
References