Determinize Automaton for simple_pattern and simple_pattern_split tokenizer#20350
Determinize Automaton for simple_pattern and simple_pattern_split tokenizer#20350Bukhtawar merged 1 commit intoopensearch-project:mainfrom
Conversation
📝 WalkthroughWalkthroughTokenizer factories now determinize regex patterns into deterministic automatons (DFAs) during factory initialization and pass DFAs to tokenizers; tests added to validate tokenization for complex regexes that require determinization. Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Index/Analyzer Config
participant Factory as SimplePatternTokenizerFactory
participant RegExp as RegExp (toAutomaton)
participant Ops as Operations.determinize
participant Tokenizer as SimplePatternTokenizer
Config->>Factory: provide pattern string
Factory->>RegExp: new RegExp(pattern).toAutomaton()
RegExp-->>Factory: NFA (possibly non-deterministic)
Factory->>Ops: determinize(NFA, DEFAULT_DETERMINIZE_WORK_LIMIT)
Ops-->>Factory: DFA (determinized automaton)
Factory->>Tokenizer: new SimplePatternTokenizer(DFA)
Tokenizer-->>Config: tokenization ready
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java (1)
20-49: Consider adding test cases for edge scenarios.The current tests provide good coverage for complex patterns requiring determinization. Consider adding tests for:
- Empty pattern: Test the default empty pattern behavior to ensure it doesn't cause errors.
- Pattern from issue #20349: Test the specific pattern
[A-Za-z0-9]+|[^\n - ]that was reported to fail before this fix.- Invalid pattern: Test that invalid regex patterns produce clear error messages (if that's the intended behavior).
🔎 Example test for the pattern from issue #20349
public void testPatternFromIssue20349() throws IOException { Settings settings = Settings.builder().put("pattern", "[A-Za-z0-9]+|[^\n - ]").build(); SimplePatternTokenizerFactory factory = new SimplePatternTokenizerFactory( IndexSettingsModule.newIndexSettings(new Index("test", "_na_"), Settings.EMPTY), null, "test", settings ); Tokenizer tokenizer = factory.create(); tokenizer.setReader(new StringReader("test123 abc")); assertTokenStreamContents(tokenizer, new String[] { "test123", "abc" }); }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.javamodules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.javamodules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: gradle-check
- GitHub Check: Analyze (java)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-latest)
🔇 Additional comments (6)
modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java (2)
37-38: LGTM: Necessary imports for determinization.The imports for
OperationsandRegExpare required for the automaton determinization logic.
56-58: Determinization is correctly applied; error handling follows established codebase patterns.The fix correctly determinizes the automaton before passing it to the tokenizer. This addresses the non-deterministic automaton issue and aligns with the approach used in
SimplePatternTokenizerFactoryand other parts of the codebase (e.g.,ConstantKeywordFieldMapper,WildcardFieldMapper).Regarding the concerns:
Empty pattern behavior: The default empty string pattern is consistent with both
SimplePatternTokenizerFactoryandSimplePatternSplitTokenizerFactory. While this scenario is not explicitly tested, the empty pattern is accepted as the system default, indicating it is expected to work gracefully.Invalid pattern handling: Like all other uses of
RegExpconstruction across the codebase, this factory does not wrap the call in a try-catch block. TheIllegalArgumentExceptionfrom invalid regex patterns is allowed to propagate naturally. This is the established pattern used inIncludeExclude.java,ConstantKeywordFieldMapper.java,WildcardFieldMapper.java, and other components in the codebase, indicating this is the intended behavior.modules/analysis-common/src/main/java/org/opensearch/analysis/common/SimplePatternTokenizerFactory.java (2)
37-38: LGTM: Necessary imports for determinization.The imports for
OperationsandRegExpare required for the automaton determinization logic, consistent with the changes inSimplePatternSplitTokenizerFactory.
56-58: Correctly mirrors the fix in SimplePatternSplitTokenizerFactory.The implementation correctly determinizes the automaton before passing it to the tokenizer, addressing the regression from Lucene's change. The same verification points mentioned in the review of
SimplePatternSplitTokenizerFactoryapply here: empty pattern behavior and invalid pattern error handling.modules/analysis-common/src/test/java/org/opensearch/analysis/common/SimplePatternTokenizerTests.java (2)
22-34: LGTM: Solid test for non-deterministic pattern.The test validates that a complex non-deterministic pattern
(a+|b+)*cworks correctly after determinization. The test input and expected output are well-chosen.
36-48: LGTM: Good coverage for split tokenizer.The test validates that the split tokenizer handles a non-deterministic pattern
(\\s+|,+)*correctly after determinization.
...-common/src/main/java/org/opensearch/analysis/common/SimplePatternSplitTokenizerFactory.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 2607f83: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20350 +/- ##
============================================
- Coverage 73.27% 73.19% -0.09%
+ Complexity 71739 71695 -44
============================================
Files 5785 5785
Lines 328143 328145 +2
Branches 47270 47270
============================================
- Hits 240445 240181 -264
- Misses 68397 68749 +352
+ Partials 19301 19215 -86 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…enizer Signed-off-by: Mohit Godwani <mgodwan@amazon.com>
|
❌ Gradle check result for 5b06a65: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 5b06a65: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…enizer (opensearch-project#20350) Signed-off-by: Mohit Godwani <mgodwan@amazon.com>
Determinize Automaton for simple_pattern and simple_pattern_split tokenizer.
Description
Instead of passing the regex pattern directly to Lucene tokenizer implementation, OpenSearch now passes a determinized automaton to Lucene.
Related Issues
Fixes #20349
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
Bug Fixes & Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.