FuzzyScorer is a .NET 10.0 class library designed to analyze text data and generate word scoring results. These results are intended to be used for creating word clouds, where the "score" typically represents the frequency of a word in a given text.
The primary objective is to provide a robust engine for:
- Splitting text into individual words.
- Calculating word frequency (scores).
- Grouping similar words based on Levenshtein distance to account for typos or variations.
While modern AI models (LLMs) act as Philosophers, understanding the meaning (semantic similarity) of wordsβknowing that "Cat" and "Dog" are both petsβFuzzyScorer acts as a Spellchecker.
We prioritize structural similarity. Instead of asking what a word means, we ask how it is built. This allows the engine to recognize that "TIGER" and "TlGER" are likely the same word, even if an AI might get confused by the visual typo.
- Live Event Feedback: Merging typos in live survey results (e.g., "Excelent" and "Excellent") to show a true consensus in word clouds.
- OCR Data Cleaning: Automatically repairing text from scanned documents where "l" (small L) is often mistaken for "I" (capital I).
- Customer Record Matching: Identifying duplicate entries in databases like "John Smith" and "Jon Smith".
- Spam Filtering: Catching "obfuscated" words designed to bypass simple filters (e.g., "M0ney" or "W4tch").
- Bio-informatics: Measuring mutation distances between DNA sequences represented as strings of characters.
- Platform: .NET 10.0
- Language: C# 13
- Project Type: Library
- Key Features: LINQ for data processing, Null Safety (Nullable enable).
FuzzyScorer includes built-in safeguards to prevent denial-of-service attacks and ensure safe operation in server environments:
-
Input Limits:
- Maximum raw input size of 1,000,000 characters (
MaxInputLength) β enforced before any processing - Maximum 10,000 words per text (
MaxWordsPerText) β enforced after splitting - Maximum word length of 256 characters (
MaxWordLength) β longer tokens are silently dropped - Similarity threshold capped at 50 (
MaxSimilarityThreshold) β prevents O(nΒ²) Levenshtein blowup
- Maximum raw input size of 1,000,000 characters (
-
Input Normalization: Removes non-alphanumeric characters (except spaces/hyphens) and invalid words before processing.
-
Cancellation Support: All scoring methods accept optional
CancellationTokenfor graceful operation cancellation in async contexts. -
Immutable Objects:
WordScoreobjects are read-only after construction with validation. -
No Hardcoded Secrets: Project contains no API keys, tokens, or sensitive data.
The project follows a strictly defined structure as documented in STRUCTURE.md.
- Scorer.cs: Contains the core word scoring and similarity logic.
- WordScore.cs: A POCO (Plain Old CLR Object) representing a word and its associated score.
- AI_RULES.md: Contains specific coding standards and AI-specific guidelines for this project.
To execute the unit tests:
- Open a terminal in the project root.
- Run the tests:
dotnet test
Analyze text and get word frequencies (case-insensitive):
string text = "Hello world hello again";
var results = Scorer.GetScoringWords(text);
foreach (var result in results)
{
Console.WriteLine($"{result.Text}: {result.Score}");
}
// Output:
// hello: 2
// world: 1
// again: 1Group similar words (e.g., handle typos) using a similarity threshold:
string text = "Hello helo hallo world wor1d";
int similarity = 1; // Allow up to 1 character difference
var results = Scorer.GetScoringWords(text, similarity);
foreach (var result in results)
{
Console.WriteLine($"{result.Text}: {result.Score}");
}
// Output:
// Hello: 3 (groups "Hello", "helo", "hallo")
// world: 2 (groups "world", "wor1d")For long-running operations or server contexts, provide a CancellationToken:
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
string largeText = /* ... large input ... */;
try
{
var results = Scorer.GetScoringWords(largeText, cts.Token);
// Process results
}
catch (OperationCanceledException)
{
Console.WriteLine("Operation was cancelled after 5 seconds.");
}Input exceeding any limit raises ArgumentException:
// Too many characters (> 1,000,000)
try
{
string hugeText = new string('a', 2_000_000);
var results = Scorer.GetScoringWords(hugeText);
}
catch (ArgumentException ex)
{
Console.WriteLine($"Input validation failed: {ex.Message}");
// "Input exceeds maximum length of 1000000 characters"
}
// Too many words (> 10,000)
try
{
string manyWords = string.Join(" ", Enumerable.Range(0, 20000).Select(i => $"word{i}"));
var results = Scorer.GetScoringWords(manyWords);
}
catch (ArgumentException ex)
{
Console.WriteLine($"Input validation failed: {ex.Message}");
// "Input contains 20000 words, exceeding limit of 10000"
}
// Similarity threshold out of range (> 50)
try
{
var results = Scorer.GetScoringWords("hello world", targetSimilarity: 99);
}
catch (ArgumentException ex)
{
Console.WriteLine($"Input validation failed: {ex.Message}");
// "targetSimilarity must be between 0 and 50"
}All contributors (including AI agents) must follow the rules defined in AI_RULES.md and respect the directory structure in STRUCTURE.md.
- PascalCase for methods and properties.
- camelCase for local variables.
- XML Documentation required for all public members.
For detailed information on security features, threat model, and best practices, see SECURITY.md.