Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- **Expanded language analyzer support**: AnalyzerFactory now supports all 27 Lucene-backed language analyzers (Arabic, Armenian, Basque, Brazilian Portuguese, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish) plus 22 Microsoft-only languages that fall back to StandardAnalyzer. Both `.lucene` and `.microsoft` name variants are accepted.
- **Complete normalizer token filter support**: NormalizerFactory now implements all 14 Azure AI Search token filters for custom normalizers: `arabic_normalization`, `asciifolding`, `cjk_width`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `sorani_normalization`, `trim`, `uppercase`.

### Added

Expand Down
41 changes: 20 additions & 21 deletions docs/LIMITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,6 @@ The simulator is designed for **development, learning, and testing purposes only
| **Pre-filtering for vectors** | HNSW does not support native filtering; uses post-filter |
| **Knowledge stores** | Complex Azure Storage integration |
| **AI enrichment skills** | OCR, Entity Recognition, etc. require Azure AI Services |
| **Managed Identity** | Azure-specific security feature |
| **Private endpoints** | Azure networking feature |
| **Customer-managed keys** | Azure Key Vault integration |
| **Debug sessions (skillset)** | Complex debugging infrastructure |
Expand Down Expand Up @@ -228,19 +227,20 @@ The following table lists **all skills available in Azure AI Search** and their

| Feature | Status | Notes |
| ------- | ------ | ----- |
| Scheduled runs | ✅ | Minimum 5 minutes |
| Scheduled runs | ✅ | Minimum 5 minutes, ISO 8601 intervals |
| On-demand runs | ✅ | - |
| Field mappings | ✅ | Basic functions |
| Field mappings | ✅ | base64Encode, base64Decode, urlEncode, urlDecode, extractTokenAtPosition |
| Output field mappings | ✅ | - |
| Change detection | ⚠️ | File timestamp only |
| Soft delete | ⚠️ | Metadata-based only |
| Parallel execution | ⚠️ | Limited |
| Change detection | ✅ | High Water Mark policy (metadata_storage_last_modified or custom column) |
| Parsing modes | ⚠️ | `default`, `json`, `jsonArray` supported; `jsonLines` and `delimitedText` not implemented |
| Soft delete | ❌ | Model accepted but not processed during indexing |
| Parallel execution | ⚠️ | Semaphore-bounded parallelism within batches |
| Incremental enrichment | ❌ | Not supported |
| Enrichment cache | ❌ | Not supported |

## Normalizer Limitations

Normalizers apply text transformations to keyword fields during filtering, sorting, and faceting. The simulator implements most of the Azure AI Search normalizers.
Normalizers apply text transformations to keyword fields during filtering, sorting, and faceting. The simulator implements all Azure AI Search normalizers, including all 14 token filters and all 3 character filter types for custom normalizers.

### Predefined Normalizers

Expand All @@ -261,15 +261,15 @@ Normalizers apply text transformations to keyword fields during filtering, sorti
| `asciifolding` | ✅ | ✅ | Removes diacritics |
| `trim` | ✅ | ✅ | Removes leading/trailing whitespace |
| `elision` | ✅ | ✅ | English contraction removal |
| `arabic_normalization` | ✅ | | Language-specific |
| `german_normalization` | ✅ | | Language-specific |
| `hindi_normalization` | ✅ | | Language-specific |
| `indic_normalization` | ✅ | | Language-specific |
| `persian_normalization` | ✅ | | Language-specific |
| `scandinavian_normalization` | ✅ | | Language-specific |
| `scandinavian_folding` | ✅ | | Language-specific |
| `sorani_normalization` | ✅ | | Language-specific |
| `cjk_width` | ✅ | | CJK width normalization |
| `arabic_normalization` | ✅ | | Normalizes Arabic orthography (alef variants, tatweel, diacritics) |
| `german_normalization` | ✅ | | Normalizes umlauts (ä→a, ö→o, ü→u) and ß→ss |
| `hindi_normalization` | ✅ | | Normalizes Devanagari nukta composites |
| `indic_normalization` | ✅ | | Removes zero-width joiners/non-joiners |
| `persian_normalization` | ✅ | | Normalizes Arabic keh/yeh to Persian equivalents |
| `scandinavian_normalization` | ✅ | | Normalizes interchangeable Scandinavian chars (æ→ä, ø→ö) |
| `scandinavian_folding` | ✅ | | Folds Scandinavian chars to ASCII (å→a, ä/æ→a, ö/ø→o) |
| `sorani_normalization` | ✅ | | Normalizes Sorani Kurdish text |
| `cjk_width` | ✅ | | Fullwidth→halfwidth ASCII, halfwidth→fullwidth Katakana |

### Character Filters (for Custom Normalizers)

Expand All @@ -283,12 +283,10 @@ Normalizers apply text transformations to keyword fields during filtering, sorti

The simulator supports custom normalizers with the following configuration:

- **Token filters**: `lowercase`, `uppercase`, `asciifolding`, `trim`, `elision`
- **Token filters**: `lowercase`, `uppercase`, `asciifolding`, `trim`, `elision`, `arabic_normalization`, `cjk_width`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `sorani_normalization`
- **Character filters**: `html_strip`, `mapping`, `pattern_replace`
- Custom normalizers can be defined in the index schema and will be validated

**Note**: Language-specific normalizers (Arabic, German, Hindi, etc.) are not implemented. For these languages, consider pre-processing your data before indexing.

## Search Query Limitations

### OData Filter Limitations
Expand Down Expand Up @@ -316,9 +314,10 @@ Not supported:
| Feature | Status |
| ------- | ------ |
| API keys | ✅ Supported |
| Entra ID authentication | ✅ Supported |
| Managed Identity (data sources) | ✅ Supported (system & user-assigned) |
| Key rotation | ⚠️ Manual only |
| RBAC | ❌ Not supported |
| Managed Identity | ❌ Not supported |
| IP restrictions | ❌ Not supported |
| Private endpoints | ❌ Not supported |
| Document-level security | ❌ Not supported |
Expand Down Expand Up @@ -359,4 +358,4 @@ When moving from the simulator to Azure AI Search:

---

*Last updated: February 13, 2026*
*Last updated: February 17, 2026*
269 changes: 266 additions & 3 deletions src/AzureAISearchSimulator.Search/NormalizerFactory.cs
Original file line number Diff line number Diff line change
Expand Up @@ -241,10 +241,264 @@ private static string ApplyTokenFilter(string value, string tokenFilterName)
"asciifolding" => RemoveDiacritics(value),
"trim" => value.Trim(),
"elision" => ApplyElision(value),
"arabic_normalization" => ApplyArabicNormalization(value),
"cjk_width" => ApplyCjkWidthNormalization(value),
"german_normalization" => ApplyGermanNormalization(value),
"hindi_normalization" => ApplyHindiNormalization(value),
"indic_normalization" => ApplyIndicNormalization(value),
"persian_normalization" => ApplyPersianNormalization(value),
"scandinavian_folding" => ApplyScandinavianFolding(value),
"scandinavian_normalization" => ApplyScandinavianNormalization(value),
"sorani_normalization" => ApplySoraniNormalization(value),
_ => value
};
}

/// <summary>
/// Applies Arabic normalization.
/// Normalizes orthographic variations in Arabic text:
/// - Removes tatweel (kashida), diacritics (tashkeel)
/// - Normalizes alef variants (آ أ إ) to bare alef (ا)
/// - Normalizes teh marbuta (ة) to heh (ه)
/// - Normalizes alef maksura (ى) to yeh (ي)
/// </summary>
private static string ApplyArabicNormalization(string value)
{
var result = value;

// Remove tatweel (kashida) U+0640
result = result.Replace("\u0640", "");

// Remove Arabic diacritics (tashkeel) U+064B-U+065F, U+0670
result = Regex.Replace(result, "[\u064B-\u065F\u0670]", "");

// Normalize alef variants to bare alef
result = result.Replace('\u0622', '\u0627'); // آ → ا
result = result.Replace('\u0623', '\u0627'); // أ → ا
result = result.Replace('\u0625', '\u0627'); // إ → ا

// Normalize teh marbuta to heh
result = result.Replace('\u0629', '\u0647'); // ة → ه

// Normalize alef maksura to yeh
result = result.Replace('\u0649', '\u064A'); // ى → ي

return result;
}

/// <summary>
/// Applies CJK width normalization.
/// - Fullwidth ASCII variants (A-Z, a-z, 0-9) → halfwidth ASCII (A-Z, a-z, 0-9)
/// - Halfwidth Katakana → fullwidth Katakana
/// </summary>
private static string ApplyCjkWidthNormalization(string value)
{
var builder = new System.Text.StringBuilder(value.Length);

foreach (var c in value)
{
// Fullwidth ASCII (U+FF01-U+FF5E) → halfwidth ASCII (U+0021-U+007E)
if (c >= '\uFF01' && c <= '\uFF5E')
{
builder.Append((char)(c - 0xFEE0));
}
// Fullwidth space → normal space
else if (c == '\u3000')
{
builder.Append(' ');
}
// Halfwidth Katakana (U+FF65-U+FF9F) → fullwidth Katakana
else if (c >= '\uFF65' && c <= '\uFF9F')
{
var katakanaOffset = c - '\uFF65';
// Map halfwidth katakana to fullwidth equivalents
char[] fullwidthKatakana = {
'\u30FB', '\u30F2', '\u30A1', '\u30A3', '\u30A5', '\u30A7', '\u30A9',
'\u30E3', '\u30E5', '\u30E7', '\u30C3', '\u30FC', '\u30A2', '\u30A4',
'\u30A6', '\u30A8', '\u30AA', '\u30AB', '\u30AD', '\u30AF', '\u30B1',
'\u30B3', '\u30B5', '\u30B7', '\u30B9', '\u30BB', '\u30BD', '\u30BF',
'\u30C1', '\u30C4', '\u30C6', '\u30C8', '\u30CA', '\u30CB', '\u30CC',
'\u30CD', '\u30CE', '\u30CF', '\u30D2', '\u30D5', '\u30D8', '\u30DB',
'\u30DE', '\u30DF', '\u30E0', '\u30E1', '\u30E2', '\u30E4', '\u30E6',
'\u30E8', '\u30E9', '\u30EA', '\u30EB', '\u30EC', '\u30ED', '\u30EF',
'\u30F3', '\u3099', '\u309A'
};
if (katakanaOffset < fullwidthKatakana.Length)
{
builder.Append(fullwidthKatakana[katakanaOffset]);
}
else
{
builder.Append(c);
}
}
else
{
builder.Append(c);
}
}

return builder.ToString();
}

/// <summary>
/// Applies German normalization.
/// - ä → a, ö → o, ü → u
/// - Ä → A, Ö → O, Ü → U
/// - ß → ss
/// </summary>
private static string ApplyGermanNormalization(string value)
{
var result = value;
result = result.Replace("ä", "a").Replace("Ä", "A");
result = result.Replace("ö", "o").Replace("Ö", "O");
result = result.Replace("ü", "u").Replace("Ü", "U");
result = result.Replace("ß", "ss");
return result;
}

/// <summary>
/// Applies Hindi normalization.
/// Normalizes Devanagari text by standardizing Unicode representations:
/// - Normalizes nukta-based composites to their canonical forms
/// - Normalizes visarga to aha
/// - Removes Chandrabindu when followed by vowel signs
/// </summary>
private static string ApplyHindiNormalization(string value)
{
var result = value;

// Normalize nukta composites: letter + nukta → pre-composed form
// क़ (क + ़) → क
result = result.Replace("\u0915\u093C", "\u0915"); // क़ → क
result = result.Replace("\u0916\u093C", "\u0916"); // ख़ → ख
result = result.Replace("\u0917\u093C", "\u0917"); // ग़ → ग
result = result.Replace("\u091C\u093C", "\u091C"); // ज़ → ज
result = result.Replace("\u0921\u093C", "\u0921"); // ड़ → ड
result = result.Replace("\u0922\u093C", "\u0922"); // ढ़ → ढ
result = result.Replace("\u092B\u093C", "\u092B"); // फ़ → फ
result = result.Replace("\u092F\u093C", "\u092F"); // य़ → य

// Normalize chandra vowels
result = result.Replace('\u0929', '\u0928'); // ऩ → न
result = result.Replace('\u0931', '\u0930'); // ऱ → र
result = result.Replace('\u0934', '\u0933'); // ऴ → ळ

// Remove nukta (independent)
result = result.Replace("\u093C", "");

return result;
}

/// <summary>
/// Applies Indic normalization.
/// Normalizes Unicode representations across Indic scripts (Devanagari, Bengali, etc.).
/// Primarily handles zero-width joiners/non-joiners and common normalization.
/// </summary>
private static string ApplyIndicNormalization(string value)
{
var result = value;

// Remove zero-width joiner and non-joiner
result = result.Replace("\u200D", ""); // Zero-width joiner
result = result.Replace("\u200C", ""); // Zero-width non-joiner

// Remove zero-width space
result = result.Replace("\u200B", "");

return result;
}

/// <summary>
/// Applies Persian normalization.
/// - Normalizes Arabic keh (ك U+0643) to Persian keheh (ک U+06A9)
/// - Normalizes Arabic yeh (ي U+064A) to Persian yeh (ی U+06CC)
/// - Removes Arabic diacritics (tashkeel)
/// - Removes tatweel (kashida)
/// </summary>
private static string ApplyPersianNormalization(string value)
{
var result = value;

// Normalize keh: Arabic keh → Persian keheh
result = result.Replace('\u0643', '\u06A9');

// Normalize yeh: Arabic yeh → Persian yeh
result = result.Replace('\u064A', '\u06CC');

// Remove tatweel (kashida)
result = result.Replace("\u0640", "");

// Remove Arabic diacritics (tashkeel)
result = Regex.Replace(result, "[\u064B-\u065F\u0670]", "");

// Normalize heh+hamza (ۀ U+06C0) to heh+yeh (هٔ)
result = result.Replace('\u06C0', '\u06D5');

return result;
}

/// <summary>
/// Applies Scandinavian folding.
/// Folds Scandinavian-specific characters to simpler forms:
/// - å → a, Å → A
/// - ä, æ → a; Ä, Æ → A
/// - ö, ø → o; Ö, Ø → O
/// </summary>
private static string ApplyScandinavianFolding(string value)
{
var result = value;
result = result.Replace('å', 'a').Replace('Å', 'A');
result = result.Replace('ä', 'a').Replace('Ä', 'A');
result = result.Replace('æ', 'a').Replace('Æ', 'A');
result = result.Replace('ö', 'o').Replace('Ö', 'O');
result = result.Replace('ø', 'o').Replace('Ø', 'O');
return result;
}

/// <summary>
/// Applies Scandinavian normalization.
/// Normalizes interchangeable Scandinavian characters:
/// - ä, æ → å (interchangeable in some Scandinavian contexts)
/// - ö, ø → ö (interchangeable in some Scandinavian contexts)
/// </summary>
private static string ApplyScandinavianNormalization(string value)
{
var result = value;
// Normalize æ → ä (both are interchangeable)
result = result.Replace('æ', 'ä').Replace('Æ', 'Ä');
// Normalize ø → ö (both are interchangeable)
result = result.Replace('ø', 'ö').Replace('Ø', 'Ö');
return result;
}

/// <summary>
/// Applies Sorani (Kurdish) normalization.
/// Normalizes Unicode representations for Sorani Kurdish:
/// - ي (Arabic yeh U+064A) → ی (Farsi yeh U+06CC)
/// - ك (Arabic keh U+0643) → ک (keheh U+06A9)
/// - Normalizes heh variations
/// </summary>
private static string ApplySoraniNormalization(string value)
{
var result = value;

// Normalize yeh
result = result.Replace('\u064A', '\u06CC'); // Arabic yeh → Farsi yeh
result = result.Replace('\u0649', '\u06CC'); // Alef maksura → Farsi yeh

// Normalize keh
result = result.Replace('\u0643', '\u06A9'); // Arabic keh → keheh

// Normalize heh
result = result.Replace('\u0647', '\u06D5'); // Heh → Kurdish heh

// Remove tatweel
result = result.Replace("\u0640", "");

return result;
}

/// <summary>
/// Removes diacritics (accents) from a string.
/// Converts characters like é→e, ñ→n, ü→u, etc.
Expand Down Expand Up @@ -288,11 +542,20 @@ public static string[] GetSupportedTokenFilters()
{
return new[]
{
"lowercase",
"uppercase",
"arabic_normalization",
"asciifolding",
"cjk_width",
"elision",
"german_normalization",
"hindi_normalization",
"indic_normalization",
"lowercase",
"persian_normalization",
"scandinavian_folding",
"scandinavian_normalization",
"sorani_normalization",
"trim",
"elision"
"uppercase"
};
}

Expand Down
Loading
Loading