Improve DeduplicationModule for production use

## Problem

The `DeduplicationModule` currently uses simplified implementations that have known limitations for production use. While functional for basic exact-match deduplication, it needs improvements for robustness and accuracy.

## Current Limitations

### 1. Naive Sentence Splitting
**Issue**: Uses regex `split(/([.!?]+\s+)/)` which breaks on abbreviations  
**Examples that fail**:
- "Dr. Smith went to the U.S. embassy."  
  → Splits into: "Dr", " Smith went to the U", " S", " embassy"
- "The price is $5.99. The total is $10.00."  
  → Incorrectly splits on decimal points

**Impact**: False positives (treating sentence fragments as duplicates) and incorrect deduplication

**Recommended fix**: Use proper sentence tokenization:
- `Intl.Segmenter` API (Node 16+) with `granularity: 'sentence'`
- Or lightweight NLP library like `compromise` or `natural`

### 2. No Fuzzy/Semantic Matching
**Issue**: Removed `similarityThreshold` option (was unused)  
**Examples that should match but don't**:
- "The quick brown fox" vs "The fast brown fox"  
- "API returns 404" vs "API returned 404 error"

**Recommended fix**:
- Add Levenshtein distance or Jaccard similarity
- Or use vector embeddings for semantic similarity (cosine distance)

### 3. `preserveFirst=false` Not Implemented
**Issue**: Option documented but not functional  
**Current behavior**: Always preserves first occurrence regardless of setting

**Recommended fix**: 
- Track last occurrence index in a Map
- Replace earlier occurrence when duplicate found

### 4. Paragraph Deduplication Normalizes Spacing  
**Issue**: `split(/\n\s*\n/)` and `join('\n\n')` collapses multiple blank lines  
**Impact**: Changes original formatting even when no duplicates found

**Recommended fix**:
- Split with capturing group: `/(\r?\n\s*\r?\n)/`
- Preserve original separators for non-duplicate paragraphs

### 5. Weak Code Block Detection
**Issue**: Regex `/```[\s\S]*?```/g` matches inline fences  
**Example**: Text like "Use \`\`\`bash" mid-sentence will create placeholder

**Recommended fix**: Use stricter multiline anchors
```typescript
/^```[\s\S]*?^```/gm
```

## Priority

**Medium-High** - Module is functional for basic use cases but has rough edges that could cause issues in production with:
- Technical documentation (many abbreviations)
- Code examples with inline backticks
- Content requiring precise formatting preservation

## Suggested Approach

1. **Quick win**: Fix code block regex (5 min)
2. **Short term**: Implement preserveFirst=false properly (30 min)
3. **Medium term**: Add proper sentence tokenization (2-3 hours)
4. **Long term**: Add fuzzy/semantic matching (1 day)

## References

- CodeRabbit review: PR #96 comments
- Related: #96 (merged with known limitations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DeduplicationModule for production use #97

Problem

Current Limitations

1. Naive Sentence Splitting

2. No Fuzzy/Semantic Matching

3. `preserveFirst=false` Not Implemented

4. Paragraph Deduplication Normalizes Spacing

5. Weak Code Block Detection

Priority

Suggested Approach

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve DeduplicationModule for production use #97

Description

Problem

Current Limitations

1. Naive Sentence Splitting

2. No Fuzzy/Semantic Matching

3. preserveFirst=false Not Implemented

4. Paragraph Deduplication Normalizes Spacing

5. Weak Code Block Detection

Priority

Suggested Approach

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

3. `preserveFirst=false` Not Implemented