Ellerbach · Ellerbach · Feb 17, 2026 · Feb 17, 2026 · Feb 17, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 
 - **Expanded language analyzer support**: AnalyzerFactory now supports all 27 Lucene-backed language analyzers (Arabic, Armenian, Basque, Brazilian Portuguese, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish) plus 22 Microsoft-only languages that fall back to StandardAnalyzer. Both `.lucene` and `.microsoft` name variants are accepted.
+- **Complete normalizer token filter support**: NormalizerFactory now implements all 14 Azure AI Search token filters for custom normalizers: `arabic_normalization`, `asciifolding`, `cjk_width`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `sorani_normalization`, `trim`, `uppercase`.
 
 ### Added
 

diff --git a/docs/LIMITATIONS.md b/docs/LIMITATIONS.md
@@ -74,7 +74,6 @@ The simulator is designed for **development, learning, and testing purposes only
 | **Pre-filtering for vectors** | HNSW does not support native filtering; uses post-filter |
 | **Knowledge stores** | Complex Azure Storage integration |
 | **AI enrichment skills** | OCR, Entity Recognition, etc. require Azure AI Services |
-| **Managed Identity** | Azure-specific security feature |
 | **Private endpoints** | Azure networking feature |
 | **Customer-managed keys** | Azure Key Vault integration |
 | **Debug sessions (skillset)** | Complex debugging infrastructure |
@@ -228,19 +227,20 @@ The following table lists **all skills available in Azure AI Search** and their
 
 | Feature | Status | Notes |
 | ------- | ------ | ----- |
-| Scheduled runs | ✅ | Minimum 5 minutes |
+| Scheduled runs | ✅ | Minimum 5 minutes, ISO 8601 intervals |
 | On-demand runs | ✅ | - |
-| Field mappings | ✅ | Basic functions |
+| Field mappings | ✅ | base64Encode, base64Decode, urlEncode, urlDecode, extractTokenAtPosition |
 | Output field mappings | ✅ | - |
-| Change detection | ⚠️ | File timestamp only |
-| Soft delete | ⚠️ | Metadata-based only |
-| Parallel execution | ⚠️ | Limited |
+| Change detection | ✅ | High Water Mark policy (metadata_storage_last_modified or custom column) |
+| Parsing modes | ⚠️ | `default`, `json`, `jsonArray` supported; `jsonLines` and `delimitedText` not implemented |
+| Soft delete | ❌ | Model accepted but not processed during indexing |
+| Parallel execution | ⚠️ | Semaphore-bounded parallelism within batches |
 | Incremental enrichment | ❌ | Not supported |
 | Enrichment cache | ❌ | Not supported |
 
 ## Normalizer Limitations
 
-Normalizers apply text transformations to keyword fields during filtering, sorting, and faceting. The simulator implements most of the Azure AI Search normalizers.
+Normalizers apply text transformations to keyword fields during filtering, sorting, and faceting. The simulator implements all Azure AI Search normalizers, including all 14 token filters and all 3 character filter types for custom normalizers.
 
 ### Predefined Normalizers
 
@@ -261,15 +261,15 @@ Normalizers apply text transformations to keyword fields during filtering, sorti
 | `asciifolding` | ✅ | ✅ | Removes diacritics |
 | `trim` | ✅ | ✅ | Removes leading/trailing whitespace |
 | `elision` | ✅ | ✅ | English contraction removal |
-| `arabic_normalization` | ✅ | ❌ | Language-specific |
-| `german_normalization` | ✅ | ❌ | Language-specific |
-| `hindi_normalization` | ✅ | ❌ | Language-specific |
-| `indic_normalization` | ✅ | ❌ | Language-specific |
-| `persian_normalization` | ✅ | ❌ | Language-specific |
-| `scandinavian_normalization` | ✅ | ❌ | Language-specific |
-| `scandinavian_folding` | ✅ | ❌ | Language-specific |
-| `sorani_normalization` | ✅ | ❌ | Language-specific |
-| `cjk_width` | ✅ | ❌ | CJK width normalization |
+| `arabic_normalization` | ✅ | ✅ | Normalizes Arabic orthography (alef variants, tatweel, diacritics) |
+| `german_normalization` | ✅ | ✅ | Normalizes umlauts (ä→a, ö→o, ü→u) and ß→ss |
+| `hindi_normalization` | ✅ | ✅ | Normalizes Devanagari nukta composites |
+| `indic_normalization` | ✅ | ✅ | Removes zero-width joiners/non-joiners |
+| `persian_normalization` | ✅ | ✅ | Normalizes Arabic keh/yeh to Persian equivalents |
+| `scandinavian_normalization` | ✅ | ✅ | Normalizes interchangeable Scandinavian chars (æ→ä, ø→ö) |
+| `scandinavian_folding` | ✅ | ✅ | Folds Scandinavian chars to ASCII (å→a, ä/æ→a, ö/ø→o) |
+| `sorani_normalization` | ✅ | ✅ | Normalizes Sorani Kurdish text |
+| `cjk_width` | ✅ | ✅ | Fullwidth→halfwidth ASCII, halfwidth→fullwidth Katakana |
 
 ### Character Filters (for Custom Normalizers)
 
@@ -283,12 +283,10 @@ Normalizers apply text transformations to keyword fields during filtering, sorti
 
 The simulator supports custom normalizers with the following configuration:
 
-- **Token filters**: `lowercase`, `uppercase`, `asciifolding`, `trim`, `elision`
+- **Token filters**: `lowercase`, `uppercase`, `asciifolding`, `trim`, `elision`, `arabic_normalization`, `cjk_width`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `sorani_normalization`
 - **Character filters**: `html_strip`, `mapping`, `pattern_replace`
 - Custom normalizers can be defined in the index schema and will be validated
 
-**Note**: Language-specific normalizers (Arabic, German, Hindi, etc.) are not implemented. For these languages, consider pre-processing your data before indexing.
-
 ## Search Query Limitations
 
 ### OData Filter Limitations
@@ -316,9 +314,10 @@ Not supported:
 | Feature | Status |
 | ------- | ------ |
 | API keys | ✅ Supported |
+| Entra ID authentication | ✅ Supported |
+| Managed Identity (data sources) | ✅ Supported (system & user-assigned) |
 | Key rotation | ⚠️ Manual only |
 | RBAC | ❌ Not supported |
-| Managed Identity | ❌ Not supported |
 | IP restrictions | ❌ Not supported |
 | Private endpoints | ❌ Not supported |
 | Document-level security | ❌ Not supported |
@@ -359,4 +358,4 @@ When moving from the simulator to Azure AI Search:
 
 ---
 
-*Last updated: February 13, 2026*
+*Last updated: February 17, 2026*
diff --git a/src/AzureAISearchSimulator.Search/NormalizerFactory.cs b/src/AzureAISearchSimulator.Search/NormalizerFactory.cs
@@ -241,10 +241,264 @@ private static string ApplyTokenFilter(string value, string tokenFilterName)
             "asciifolding" => RemoveDiacritics(value),
             "trim" => value.Trim(),
             "elision" => ApplyElision(value),
+            "arabic_normalization" => ApplyArabicNormalization(value),
+            "cjk_width" => ApplyCjkWidthNormalization(value),
+            "german_normalization" => ApplyGermanNormalization(value),
+            "hindi_normalization" => ApplyHindiNormalization(value),
+            "indic_normalization" => ApplyIndicNormalization(value),
+            "persian_normalization" => ApplyPersianNormalization(value),
+            "scandinavian_folding" => ApplyScandinavianFolding(value),
+            "scandinavian_normalization" => ApplyScandinavianNormalization(value),
+            "sorani_normalization" => ApplySoraniNormalization(value),
             _ => value
         };
     }
 
+    /// <summary>
+    /// Applies Arabic normalization.
+    /// Normalizes orthographic variations in Arabic text:
+    /// - Removes tatweel (kashida), diacritics (tashkeel)
+    /// - Normalizes alef variants (آ أ إ) to bare alef (ا)
+    /// - Normalizes teh marbuta (ة) to heh (ه)
+    /// - Normalizes alef maksura (ى) to yeh (ي)
+    /// </summary>
+    private static string ApplyArabicNormalization(string value)
+    {
+        var result = value;
+
+        // Remove tatweel (kashida) U+0640
+        result = result.Replace("\u0640", "");
+
+        // Remove Arabic diacritics (tashkeel) U+064B-U+065F, U+0670
+        result = Regex.Replace(result, "[\u064B-\u065F\u0670]", "");
+
+        // Normalize alef variants to bare alef
+        result = result.Replace('\u0622', '\u0627'); // آ → ا
+        result = result.Replace('\u0623', '\u0627'); // أ → ا
+        result = result.Replace('\u0625', '\u0627'); // إ → ا
+
+        // Normalize teh marbuta to heh
+        result = result.Replace('\u0629', '\u0647'); // ة → ه
+
+        // Normalize alef maksura to yeh
+        result = result.Replace('\u0649', '\u064A'); // ى → ي
+
+        return result;
+    }
+
+    /// <summary>
+    /// Applies CJK width normalization.
+    /// - Fullwidth ASCII variants (Ａ-Ｚ, ａ-ｚ, ０-９) → halfwidth ASCII (A-Z, a-z, 0-9)
+    /// - Halfwidth Katakana → fullwidth Katakana
+    /// </summary>
+    private static string ApplyCjkWidthNormalization(string value)
+    {
+        var builder = new System.Text.StringBuilder(value.Length);
+
+        foreach (var c in value)
+        {
+            // Fullwidth ASCII (U+FF01-U+FF5E) → halfwidth ASCII (U+0021-U+007E)
+            if (c >= '\uFF01' && c <= '\uFF5E')
+            {
+                builder.Append((char)(c - 0xFEE0));
+            }
+            // Fullwidth space → normal space
+            else if (c == '\u3000')
+            {
+                builder.Append(' ');
+            }
+            // Halfwidth Katakana (U+FF65-U+FF9F) → fullwidth Katakana
+            else if (c >= '\uFF65' && c <= '\uFF9F')
+            {
+                var katakanaOffset = c - '\uFF65';
+                // Map halfwidth katakana to fullwidth equivalents
+                char[] fullwidthKatakana = {
+                    '\u30FB', '\u30F2', '\u30A1', '\u30A3', '\u30A5', '\u30A7', '\u30A9',
+                    '\u30E3', '\u30E5', '\u30E7', '\u30C3', '\u30FC', '\u30A2', '\u30A4',
+                    '\u30A6', '\u30A8', '\u30AA', '\u30AB', '\u30AD', '\u30AF', '\u30B1',
+                    '\u30B3', '\u30B5', '\u30B7', '\u30B9', '\u30BB', '\u30BD', '\u30BF',
+                    '\u30C1', '\u30C4', '\u30C6', '\u30C8', '\u30CA', '\u30CB', '\u30CC',
+                    '\u30CD', '\u30CE', '\u30CF', '\u30D2', '\u30D5', '\u30D8', '\u30DB',
+                    '\u30DE', '\u30DF', '\u30E0', '\u30E1', '\u30E2', '\u30E4', '\u30E6',
+                    '\u30E8', '\u30E9', '\u30EA', '\u30EB', '\u30EC', '\u30ED', '\u30EF',
+                    '\u30F3', '\u3099', '\u309A'
+                };
+                if (katakanaOffset < fullwidthKatakana.Length)
+                {
+                    builder.Append(fullwidthKatakana[katakanaOffset]);
+                }
+                else
+                {
+                    builder.Append(c);
+                }
+            }
+            else
+            {
+                builder.Append(c);
+            }
+        }
+
+        return builder.ToString();
+    }
+
+    /// <summary>
+    /// Applies German normalization.
+    /// - ä → a, ö → o, ü → u
+    /// - Ä → A, Ö → O, Ü → U
+    /// - ß → ss
+    /// </summary>
+    private static string ApplyGermanNormalization(string value)
+    {
+        var result = value;
+        result = result.Replace("ä", "a").Replace("Ä", "A");
+        result = result.Replace("ö", "o").Replace("Ö", "O");
+        result = result.Replace("ü", "u").Replace("Ü", "U");
+        result = result.Replace("ß", "ss");
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Hindi normalization.
+    /// Normalizes Devanagari text by standardizing Unicode representations:
+    /// - Normalizes nukta-based composites to their canonical forms
+    /// - Normalizes visarga to aha
+    /// - Removes Chandrabindu when followed by vowel signs
+    /// </summary>
+    private static string ApplyHindiNormalization(string value)
+    {
+        var result = value;
+
+        // Normalize nukta composites: letter + nukta → pre-composed form
+        // क़ (क + ़) → क
+        result = result.Replace("\u0915\u093C", "\u0915"); // क़ → क
+        result = result.Replace("\u0916\u093C", "\u0916"); // ख़ → ख
+        result = result.Replace("\u0917\u093C", "\u0917"); // ग़ → ग
+        result = result.Replace("\u091C\u093C", "\u091C"); // ज़ → ज
+        result = result.Replace("\u0921\u093C", "\u0921"); // ड़ → ड
+        result = result.Replace("\u0922\u093C", "\u0922"); // ढ़ → ढ
+        result = result.Replace("\u092B\u093C", "\u092B"); // फ़ → फ
+        result = result.Replace("\u092F\u093C", "\u092F"); // य़ → य
+
+        // Normalize chandra vowels
+        result = result.Replace('\u0929', '\u0928'); // ऩ → न
+        result = result.Replace('\u0931', '\u0930'); // ऱ → र
+        result = result.Replace('\u0934', '\u0933'); // ऴ → ळ
+
+        // Remove nukta (independent)
+        result = result.Replace("\u093C", "");
+
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Indic normalization.
+    /// Normalizes Unicode representations across Indic scripts (Devanagari, Bengali, etc.).
+    /// Primarily handles zero-width joiners/non-joiners and common normalization.
+    /// </summary>
+    private static string ApplyIndicNormalization(string value)
+    {
+        var result = value;
+
+        // Remove zero-width joiner and non-joiner
+        result = result.Replace("\u200D", ""); // Zero-width joiner
+        result = result.Replace("\u200C", ""); // Zero-width non-joiner
+
+        // Remove zero-width space
+        result = result.Replace("\u200B", "");
+
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Persian normalization.
+    /// - Normalizes Arabic keh (ك U+0643) to Persian keheh (ک U+06A9)
+    /// - Normalizes Arabic yeh (ي U+064A) to Persian yeh (ی U+06CC)
+    /// - Removes Arabic diacritics (tashkeel)
+    /// - Removes tatweel (kashida)
+    /// </summary>
+    private static string ApplyPersianNormalization(string value)
+    {
+        var result = value;
+
+        // Normalize keh: Arabic keh → Persian keheh
+        result = result.Replace('\u0643', '\u06A9');
+
+        // Normalize yeh: Arabic yeh → Persian yeh
+        result = result.Replace('\u064A', '\u06CC');
+
+        // Remove tatweel (kashida)
+        result = result.Replace("\u0640", "");
+
+        // Remove Arabic diacritics (tashkeel)
+        result = Regex.Replace(result, "[\u064B-\u065F\u0670]", "");
+
+        // Normalize heh+hamza (ۀ U+06C0) to heh+yeh (هٔ)
+        result = result.Replace('\u06C0', '\u06D5');
+
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Scandinavian folding.
+    /// Folds Scandinavian-specific characters to simpler forms:
+    /// - å → a, Å → A
+    /// - ä, æ → a; Ä, Æ → A
+    /// - ö, ø → o; Ö, Ø → O
+    /// </summary>
+    private static string ApplyScandinavianFolding(string value)
+    {
+        var result = value;
+        result = result.Replace('å', 'a').Replace('Å', 'A');
+        result = result.Replace('ä', 'a').Replace('Ä', 'A');
+        result = result.Replace('æ', 'a').Replace('Æ', 'A');
+        result = result.Replace('ö', 'o').Replace('Ö', 'O');
+        result = result.Replace('ø', 'o').Replace('Ø', 'O');
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Scandinavian normalization.
+    /// Normalizes interchangeable Scandinavian characters:
+    /// - ä, æ → å (interchangeable in some Scandinavian contexts)
+    /// - ö, ø → ö (interchangeable in some Scandinavian contexts)
+    /// </summary>
+    private static string ApplyScandinavianNormalization(string value)
+    {
+        var result = value;
+        // Normalize æ → ä (both are interchangeable)
+        result = result.Replace('æ', 'ä').Replace('Æ', 'Ä');
+        // Normalize ø → ö (both are interchangeable)
+        result = result.Replace('ø', 'ö').Replace('Ø', 'Ö');
+        return result;
+    }
+
+    /// <summary>
+    /// Applies Sorani (Kurdish) normalization.
+    /// Normalizes Unicode representations for Sorani Kurdish:
+    /// - ي (Arabic yeh U+064A) → ی (Farsi yeh U+06CC)
+    /// - ك (Arabic keh U+0643) → ک (keheh U+06A9)
+    /// - Normalizes heh variations
+    /// </summary>
+    private static string ApplySoraniNormalization(string value)
+    {
+        var result = value;
+
+        // Normalize yeh
+        result = result.Replace('\u064A', '\u06CC'); // Arabic yeh → Farsi yeh
+        result = result.Replace('\u0649', '\u06CC'); // Alef maksura → Farsi yeh
+
+        // Normalize keh
+        result = result.Replace('\u0643', '\u06A9'); // Arabic keh → keheh
+
+        // Normalize heh
+        result = result.Replace('\u0647', '\u06D5'); // Heh → Kurdish heh
+
+        // Remove tatweel
+        result = result.Replace("\u0640", "");
+
+        return result;
+    }
+
     /// <summary>
     /// Removes diacritics (accents) from a string.
     /// Converts characters like é→e, ñ→n, ü→u, etc.
@@ -288,11 +542,20 @@ public static string[] GetSupportedTokenFilters()
     {
         return new[]
         {
-            "lowercase",
-            "uppercase",
+            "arabic_normalization",
             "asciifolding",
+            "cjk_width",
+            "elision",
+            "german_normalization",
+            "hindi_normalization",
+            "indic_normalization",
+            "lowercase",
+            "persian_normalization",
+            "scandinavian_folding",
+            "scandinavian_normalization",
+            "sorani_normalization",
             "trim",
-            "elision"
+            "uppercase"
         };
     }