Skip to content

Determinize Automaton for simple_pattern and simple_pattern_split tokenizer#20350

Merged
Bukhtawar merged 1 commit intoopensearch-project:mainfrom
mgodwan:tokenize
Jan 6, 2026
Merged

Determinize Automaton for simple_pattern and simple_pattern_split tokenizer#20350
Bukhtawar merged 1 commit intoopensearch-project:mainfrom
mgodwan:tokenize

Conversation

@mgodwan
Copy link
Copy Markdown
Member

@mgodwan mgodwan commented Jan 1, 2026

Determinize Automaton for simple_pattern and simple_pattern_split tokenizer.

Description

Instead of passing the regex pattern directly to Lucene tokenizer implementation, OpenSearch now passes a determinized automaton to Lucene.

Related Issues

Fixes #20349

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes & Improvements

    • Tokenizers now handle complex regular-expression patterns more reliably by using a deterministic processing approach, improving accuracy and stability of token extraction.
  • Tests

    • Added tests that validate tokenizer behavior with complex regex patterns to ensure consistent tokenization outcomes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 1, 2026

📝 Walkthrough

Walkthrough

Tokenizer factories now determinize regex patterns into deterministic automatons (DFAs) during factory initialization and pass DFAs to tokenizers; tests added to validate tokenization for complex regexes that require determinization.

Changes

Cohort / File(s) Summary
Tokenizer factories
modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java, modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java
Added imports for RegExp and Operations. Constructors/creation paths now compile the configured pattern to a RegExp, determinize it (using Operations.determinize(..., Operations.DEFAULT_DETERMINIZE_WORK_LIMIT)), store/pass the resulting Automaton/DFA, and instantiate tokenizers with the DFA instead of the raw pattern string.
Tests
modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java
New test class with tests testComplexRegexRequiringDeterminization (pattern `(a+

Sequence Diagram(s)

sequenceDiagram
    participant Config as Index/Analyzer Config
    participant Factory as SimplePatternTokenizerFactory
    participant RegExp as RegExp (toAutomaton)
    participant Ops as Operations.determinize
    participant Tokenizer as SimplePatternTokenizer

    Config->>Factory: provide pattern string
    Factory->>RegExp: new RegExp(pattern).toAutomaton()
    RegExp-->>Factory: NFA (possibly non-deterministic)
    Factory->>Ops: determinize(NFA, DEFAULT_DETERMINIZE_WORK_LIMIT)
    Ops-->>Factory: DFA (determinized automaton)
    Factory->>Tokenizer: new SimplePatternTokenizer(DFA)
    Tokenizer-->>Config: tokenization ready
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I stitched a pattern from string to state,

Determinized paths to steady the gate.
Tokens now hop in orderly row,
No more nondet chaos to overthrow —
Hooray! DFA makes token streams great. 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: determinizing automatons for simple_pattern and simple_pattern_split tokenizers.
Description check ✅ Passed The description follows the template, includes a clear explanation of the change, references the related issue (#20349), and confirms the checklist items.
Linked Issues check ✅ Passed The changes directly address issue #20349 by determinizing automatons in both SimplePatternTokenizerFactory and SimplePatternSplitTokenizerFactory, resolving the non-deterministic automaton error.
Out of Scope Changes check ✅ Passed All changes are directly related to determinizing automatons for the two tokenizer factories and include appropriate tests for the new behavior.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2607f83 and 5b06a65.

📒 Files selected for processing (3)
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java
  • modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java
🚧 Files skipped from review as they are similar to previous changes (3)
  • modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Mend Security Check

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working lucene Other labels Jan 1, 2026
@mgodwan mgodwan marked this pull request as ready for review January 1, 2026 11:51
@mgodwan mgodwan requested a review from a team as a code owner January 1, 2026 11:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java (1)

20-49: Consider adding test cases for edge scenarios.

The current tests provide good coverage for complex patterns requiring determinization. Consider adding tests for:

  1. Empty pattern: Test the default empty pattern behavior to ensure it doesn't cause errors.
  2. Pattern from issue #20349: Test the specific pattern [A-Za-z0-9]+|[^\n - ] that was reported to fail before this fix.
  3. Invalid pattern: Test that invalid regex patterns produce clear error messages (if that's the intended behavior).
🔎 Example test for the pattern from issue #20349
public void testPatternFromIssue20349() throws IOException {
    Settings settings = Settings.builder().put("pattern", "[A-Za-z0-9]+|[^\n - ]").build();
    SimplePatternTokenizerFactory factory = new SimplePatternTokenizerFactory(
        IndexSettingsModule.newIndexSettings(new Index("test", "_na_"), Settings.EMPTY),
        null,
        "test",
        settings
    );
    
    Tokenizer tokenizer = factory.create();
    tokenizer.setReader(new StringReader("test123 abc"));
    assertTokenStreamContents(tokenizer, new String[] { "test123", "abc" });
}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9b5bd1 and 2607f83.

📒 Files selected for processing (3)
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java
  • modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
🔇 Additional comments (6)
modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java (2)

37-38: LGTM: Necessary imports for determinization.

The imports for Operations and RegExp are required for the automaton determinization logic.


56-58: Determinization is correctly applied; error handling follows established codebase patterns.

The fix correctly determinizes the automaton before passing it to the tokenizer. This addresses the non-deterministic automaton issue and aligns with the approach used in SimplePatternTokenizerFactory and other parts of the codebase (e.g., ConstantKeywordFieldMapper, WildcardFieldMapper).

Regarding the concerns:

  1. Empty pattern behavior: The default empty string pattern is consistent with both SimplePatternTokenizerFactory and SimplePatternSplitTokenizerFactory. While this scenario is not explicitly tested, the empty pattern is accepted as the system default, indicating it is expected to work gracefully.

  2. Invalid pattern handling: Like all other uses of RegExp construction across the codebase, this factory does not wrap the call in a try-catch block. The IllegalArgumentException from invalid regex patterns is allowed to propagate naturally. This is the established pattern used in IncludeExclude.java, ConstantKeywordFieldMapper.java, WildcardFieldMapper.java, and other components in the codebase, indicating this is the intended behavior.

modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java (2)

37-38: LGTM: Necessary imports for determinization.

The imports for Operations and RegExp are required for the automaton determinization logic, consistent with the changes in SimplePatternSplitTokenizerFactory.


56-58: Correctly mirrors the fix in SimplePatternSplitTokenizerFactory.

The implementation correctly determinizes the automaton before passing it to the tokenizer, addressing the regression from Lucene's change. The same verification points mentioned in the review of SimplePatternSplitTokenizerFactory apply here: empty pattern behavior and invalid pattern error handling.

modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java (2)

22-34: LGTM: Solid test for non-deterministic pattern.

The test validates that a complex non-deterministic pattern (a+|b+)*c works correctly after determinization. The test input and expected output are well-chosen.


36-48: LGTM: Good coverage for split tokenizer.

The test validates that the split tokenizer handles a non-deterministic pattern (\\s+|,+)* correctly after determinization.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 1, 2026

❌ Gradle check result for 2607f83: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mgodwan mgodwan closed this Jan 1, 2026
@mgodwan mgodwan reopened this Jan 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 1, 2026

✅ Gradle check result for 2607f83: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.19%. Comparing base (e9b5bd1) to head (5b06a65).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20350      +/-   ##
============================================
- Coverage     73.27%   73.19%   -0.09%     
+ Complexity    71739    71695      -44     
============================================
  Files          5785     5785              
  Lines        328143   328145       +2     
  Branches      47270    47270              
============================================
- Hits         240445   240181     -264     
- Misses        68397    68749     +352     
+ Partials      19301    19215      -86     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…enizer

Signed-off-by: Mohit Godwani <mgodwan@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 5, 2026

❌ Gradle check result for 5b06a65: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mgodwan mgodwan closed this Jan 5, 2026
@mgodwan mgodwan reopened this Jan 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 5, 2026

❌ Gradle check result for 5b06a65: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mgodwan mgodwan closed this Jan 6, 2026
@mgodwan mgodwan reopened this Jan 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

✅ Gradle check result for 5b06a65: SUCCESS

tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Create Index Fails for index using simple_pattern tokenizer

2 participants