Skip to content

[Derived Source] Support ignore_above/normalizer for keyword/wildcard field and optimise text field lookup#20113

Open
tanik98 wants to merge 5 commits intoopensearch-project:mainfrom
tanik98:derived-source-text-field
Open

[Derived Source] Support ignore_above/normalizer for keyword/wildcard field and optimise text field lookup#20113
tanik98 wants to merge 5 commits intoopensearch-project:mainfrom
tanik98:derived-source-text-field

Conversation

@tanik98
Copy link
Copy Markdown
Contributor

@tanik98 tanik98 commented Nov 28, 2025

  • For keyword and wildcard field handle ignore_above by explicitly storing values exceeding that and for noarmalizer store value in all cases, and use stored field for using it as a fallback
  • For text field, use sub keyword fields for deriving source(using doc_values) instead of explicitly storing the value, this way we can avoid storing duplicates(as a doc_values under sub keyword and stored field under text field).
  • In case of multiple keyword fields, we will be going through each one to find first one, and in case of ingored field, we are choosing keyword with max value of ignore_above to save the dupplication of docValue and fallback stored field. Updation of sub keyword field is append only, we can't remove sub keyword once added, these would get appended in applicable keyword fields for derived source as and when mapping gets updated.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for ignore_above parameter on keyword and wildcard fields.
    • Optimized text field retrieval in derived source scenarios.
  • Bug Fixes

    • Improved handling of values exceeding the ignore_above threshold for keyword and wildcard fields with derived source enabled.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 28, 2025

📝 Walkthrough

Walkthrough

This PR extends support for the ignore_above configuration to keyword and wildcard fields within derived source mappings. It introduces a composite field value fetcher mechanism to retrieve values from multiple sources (stored fields, doc values, and a dedicated ignore field) and updates core mapper classes to store values exceeding ignore_above in a special derived-source field for later retrieval.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Adds entry documenting support for ignore_above on keyword/wildcard fields and text field optimization under derived source.
Core mapper infrastructure
server/src/main/java/org/opensearch/index/mapper/CompositeFieldValueFetcher.java, FieldValueFetcher.java
Introduces CompositeFieldValueFetcher class that chains multiple FieldValueFetcher instances, trying each in sequence until a non-empty result is found. Updates FieldValueFetcher.write to handle null value lists.
Field type additions
server/src/main/java/org/opensearch/index/mapper/StringFieldType.java
Adds derivedSourceIgnoreFieldName() method and private constant IGNORED_VALUE_FIELD_SUFFIX to support derived source ignore field naming.
Keyword field mapper
server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java
Relaxes derived-source eligibility checks for ignore_above; introduces DerivedSourceHelper to manage value fetching from primary field and fallback ignore field; stores values exceeding ignore_above in dedicated ignore field when derived source is enabled.
Wildcard field mapper
server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java
Updates parse flow to store values exceeding ignore_above in derived source ignore field; relaxes guard conditions in canDeriveSourceInternal; replaces single SortedSetDocValuesFetcher with composite approach; introduces DerivedSourceHelper.
Text field mapper
server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java
Extends field type construction to propagate MultiFields; adds public getters/setters for derived-source keyword support and ignore length tracking; injects logic to store ignore field when needed; reworks derived source field value fetching to use CompositeFieldValueFetcher; introduces DerivedSourceHelper to detect and collect keyword-derived fetchers.
Match-only text field mapper
server/src/main/java/org/opensearch/index/mapper/MatchOnlyTextFieldMapper.java
Updates buildFieldType method signature to accept MultiFields parameter; threads multi-field configuration through field type construction.
Integration tests
server/src/internalClusterTest/java/org/opensearch/get/GetActionIT.java, recovery/FullRollingRestartIT.java, search/simple/SimpleSearchIT.java, update/UpdateIT.java
Updates derived source test mappings: extends keyword sub-field definitions with ignore_above configuration; removes store: true from text fields and doc_values: true from wildcard fields; updates document indexing and assertions to reflect new field configurations.
Unit tests
server/src/test/java/org/opensearch/index/mapper/CompositeFieldValueFetcherTests.java
New test class validating CompositeFieldValueFetcher behavior across multiple fetcher scenarios, empty lists, and exception propagation.
Keyword field mapper tests
server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java
Adds factory method getMapperServiceForDerivedSource and tests for derived source behavior with ignore_above threshold scenarios; replaces exception expectations with no-exception assertions for canDeriveSource path.
Text field mapper tests
server/src/test/java/org/opensearch/index/mapper/TextFieldMapperTests.java
Renames and refactors stored field test; adds multiple tests for derived source with keyword sub-fields, ignore_above thresholds, normalization, and long text handling; updates IndexAnalyzers construction with "whitespace" analyzer.
Wildcard field mapper tests
server/src/test/java/org/opensearch/index/mapper/WildcardFieldMapperTests.java
Adds tests for derived value fetching and parse behavior around ignore_above threshold with derived source enabled; introduces getMapperServiceForDerivedSource helper and assertDoesNotThrow utility.
Object mapper tests
server/src/test/java/org/opensearch/index/mapper/ObjectMapperTests.java
Removes multiple derive-source validation test cases (ignore_above, nested object fields, copy_to, multi-field scenarios); simplifies remaining multi-type field mapping by removing stored and doc_values flags.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Parser as Field Parser
    participant IgnoreField as Ignore Field Storage
    participant Derived as Derived Source Generator
    participant Composite as CompositeFieldValueFetcher
    participant Primary as Primary Fetcher
    participant Fallback as Fallback Fetcher

    Client->>Parser: Index document with text exceeding ignore_above
    Parser->>Parser: Check if derived source enabled & value > ignore_above
    alt Value exceeds ignore_above threshold
        Parser->>IgnoreField: Store value in derivedSourceIgnore field
        Parser->>Derived: Skip normal indexing, return early
    else Value within threshold
        Parser->>Parser: Normal field indexing
    end
    
    Note over Client,Fallback: Later: Deriving source for document
    Client->>Derived: Request derived source
    Derived->>Composite: Create fetcher chain
    Composite->>Primary: Try primary fetcher (doc values or stored)
    alt Primary has value
        Primary-->>Composite: Return value
        Composite-->>Derived: Return converted value
    else Primary empty
        Composite->>Fallback: Try fallback fetcher (ignore field)
        Fallback->>IgnoreField: Read from derivedSourceIgnore field
        IgnoreField-->>Fallback: Return stored value
        Fallback-->>Composite: Return converted value
        Composite-->>Derived: Return value
    end
    Derived-->>Client: Return derived source with retrieved value
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • CompositeFieldValueFetcher logic and proper ordering of fetcher chains across multiple field mappers
  • Derived source storage and retrieval paths in KeywordFieldMapper, WildcardFieldMapper, and TextFieldMapper — verify null-safety and edge cases around ignore_above thresholds
  • Public API additions to StringFieldType and TextFieldType — confirm backward compatibility and semantic correctness of new getters/setters
  • DerivedSourceHelper implementations across three mapper classes — ensure consistency in keyword detection and fetcher collection logic
  • Multi-field wiring changes in TextFieldMapper and MatchOnlyTextFieldMapper — verify MultiFields parameter threading does not break existing functionality
  • Integration test updates — confirm that removed store and doc_values flags do not inadvertently change test expectations or mask regressions

Suggested labels

Search:Performance

Suggested reviewers

  • cwperks
  • gbbafna
  • saratvemulapalli
  • sachinpkale
  • kotwanikunal

Poem

🐰 Ignoring above, we now derive with care,
Composite fetchers fetch from everywhere,
Keyword and wildcard fields now play their part,
Derived source shines, a clever work of art!
Long text? No problem—we store what's spared!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.12% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main changes: supporting ignore_above/normalizer for keyword/wildcard fields and optimizing text field lookup under derived source, matching the changelog entry and core implementation across multiple field mappers.
Description check ✅ Passed The PR description adequately explains the changes being made, including the derived-source handling modifications for keyword/wildcard fields, text field optimization, and multi-keyword field strategy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@tanik98 tanik98 changed the title Improve derived source storage footprint for text field and handle ig… Improve storage footprint for text field under derived source and handle ig… Nov 28, 2025
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for e4707a1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 6ad1447: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

❌ Gradle check result for c152793: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@tanik98 tanik98 changed the title Improve storage footprint for text field under derived source and handle ig… Support ignore_above for keyword/wildcard field and optimise text field under derived source Dec 8, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

❌ Gradle check result for 9a2ff84: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@tanik98 tanik98 force-pushed the derived-source-text-field branch from 9a2ff84 to de290a0 Compare December 8, 2025 19:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 8, 2025

❌ Gradle check result for de290a0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@tanik98 tanik98 force-pushed the derived-source-text-field branch from de290a0 to 007aa20 Compare December 9, 2025 03:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

❌ Gradle check result for 007aa20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@tanik98 tanik98 force-pushed the derived-source-text-field branch from 007aa20 to 88ac318 Compare December 9, 2025 04:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 9, 2025

✅ Gradle check result for 88ac318: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 9, 2025

Codecov Report

❌ Patch coverage is 82.06897% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.26%. Comparing base (d47931e) to head (88ac318).
⚠️ Report is 75 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/index/mapper/WildcardFieldMapper.java 71.42% 6 Missing and 4 partials ⚠️
...rg/opensearch/index/mapper/KeywordFieldMapper.java 76.47% 5 Missing and 3 partials ⚠️
...a/org/opensearch/index/mapper/TextFieldMapper.java 86.20% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20113      +/-   ##
============================================
- Coverage     73.30%   73.26%   -0.04%     
- Complexity    71732    71761      +29     
============================================
  Files          5793     5794       +1     
  Lines        328056   328228     +172     
  Branches      47245    47285      +40     
============================================
+ Hits         240476   240490      +14     
- Misses        68264    68442     +178     
+ Partials      19316    19296      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (1)

232-238: Dead code in else branch.

The condition if (fieldType().hasDocValues() == false) at line 235 is always true within this else branch (since we reach it when fieldType().hasDocValues() returns false at line 232). This makes the inner condition redundant.

     if (fieldType().hasDocValues()) {
         context.doc().add(new SortedSetDocValuesField(fieldType().name(), binaryValue));
     } else {
-        if (fieldType().hasDocValues() == false) {
-            createFieldNamesField(context);
-        }
+        createFieldNamesField(context);
     }
🧹 Nitpick comments (11)
server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java (2)

614-647: Consider extracting the helper to a shared test utility.

The getMapperServiceForDerivedSource() method is nearly identical to the one in WildcardFieldMapperTests.java (lines 438-471). This code duplication could be consolidated into a shared test utility class or a common base class method to improve maintainability.


689-695: Custom assertDoesNotThrow helper is duplicated.

This helper is also present in WildcardFieldMapperTests.java. Consider using JUnit's built-in capabilities or consolidating this into a shared test utility. Additionally, the current implementation catches Exception but not Throwable, which means AssertionError and other errors won't be caught.

server/src/main/java/org/opensearch/index/mapper/CompositeFieldValueFetcher.java (1)

27-30: Consider defensive copy of the fetchers list.

The constructor stores a direct reference to the passed list. If the caller modifies the list after construction, it could lead to unexpected behavior.

 public CompositeFieldValueFetcher(String simpleName, List<FieldValueFetcher> fieldValueFetchers) {
     super(simpleName);
-    this.fieldValueFetchers = fieldValueFetchers;
+    this.fieldValueFetchers = List.copyOf(fieldValueFetchers);
 }
server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (1)

963-1015: DerivedSourceHelper is duplicated across mappers.

The DerivedSourceHelper inner class is nearly identical to the one in KeywordFieldMapper.java (lines 959-1004 in the relevant code snippet). Consider extracting this to a shared utility class to reduce code duplication and improve maintainability.

server/src/test/java/org/opensearch/index/mapper/WildcardFieldMapperTests.java (2)

438-471: Duplicated helper method across test classes.

getMapperServiceForDerivedSource() is nearly identical to the one in KeywordFieldMapperTests.java. This should be consolidated into a shared test utility to avoid maintenance burden.

Additionally, line 456 has the same issue noted in KeywordFieldMapperTests.java - getIndexSettings() may return different settings than the locally constructed indexSettings with derived source enabled.

-        ScriptService scriptService = new ScriptService(getIndexSettings(), scriptModule.engines, scriptModule.contexts);
+        ScriptService scriptService = new ScriptService(indexSettings.getSettings(), scriptModule.engines, scriptModule.contexts);

500-506: Duplicated assertDoesNotThrow helper.

Same as in KeywordFieldMapperTests.java - consider consolidating into a shared test utility.

server/src/test/java/org/opensearch/index/mapper/TextFieldMapperTests.java (1)

1197-1213: Consider adding assertion for sub-keyword doc_values presence.

The test verifies that no stored field is added when the value fits within ignore_above, but it doesn't verify that the keyword sub-field's doc_values would actually contain the value. Consider adding an assertion to confirm the keyword field is indexed properly.

         // Verify stored field is not added
         TextFieldMapper textFieldMapper = (TextFieldMapper) mapperService.documentMapper().mappers().getMapper("field");
         IndexableField[] storedFields = doc.rootDoc().getFields(textFieldMapper.fieldType().derivedSourceIgnoreFieldName());
         assertEquals(0, storedFields.length);
+
+        // Verify keyword sub-field has doc_values
+        IndexableField[] keywordFields = doc.rootDoc().getFields("field.keyword");
+        assertTrue(keywordFields.length > 0);
     }
server/src/main/java/org/opensearch/index/mapper/MatchOnlyTextFieldMapper.java (1)

156-190: Unused multiFields parameter in buildFieldType.

The multiFields parameter is added to the method signature for interface consistency with TextFieldMapper.Builder.buildFieldType(), but it's not used in the method body. This is likely intentional since MatchOnlyTextFieldMapper may not need derived-source keyword support.

Consider adding a brief comment explaining why multiFields is unused, or if derived-source keyword support should be added for match_only_text fields in the future.

     @Override
-    protected MatchOnlyTextFieldType buildFieldType(FieldType fieldType, MultiFields multiFields, BuilderContext context) {
+    protected MatchOnlyTextFieldType buildFieldType(FieldType fieldType, MultiFields multiFields, BuilderContext context) {
+        // Note: multiFields parameter unused - match_only_text does not support derived source keyword optimization
         NamedAnalyzer indexAnalyzer = analyzers.getIndexAnalyzer();
server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java (3)

1276-1319: Review the anonymous MappedFieldType implementation for ignored field fetching.

The anonymous MappedFieldType at lines 1286-1308 returns null from valueFetcher() and termQuery(). While this works for the current use case (only StoredFieldFetcher uses isStored()), this could cause issues if these methods are called unexpectedly.

Consider using a minimal implementation that throws UnsupportedOperationException for methods that shouldn't be called, making debugging easier if misuse occurs.

             @Override
             public ValueFetcher valueFetcher(QueryShardContext context, SearchLookup searchLookup, String format) {
-                return null;
+                throw new UnsupportedOperationException("valueFetcher not supported for derived source ignore field");
             }

             @Override
             public Query termQuery(Object value, QueryShardContext context) {
-                return null;
+                throw new UnsupportedOperationException("termQuery not supported for derived source ignore field");
             }

1334-1338: Consider clarifying the KEYWORD_ANALYZER check.

The condition Lucene.KEYWORD_ANALYZER.equals(keywordFieldMapper.fieldType().normalizer()) allows keywords with KEYWORD_ANALYZER as normalizer. This might be intentional (treating no-op normalizer as acceptable), but could be confusing.

Consider adding a comment explaining this behavior:

     private static boolean isDerivedSourceSupportedKeyword(KeywordFieldMapper keywordFieldMapper) {
+        // A keyword supports derived source if it has no normalizer (or uses the identity KEYWORD_ANALYZER)
+        // and has either stored field or doc_values enabled
         return (keywordFieldMapper.fieldType().normalizer() == null
             || Lucene.KEYWORD_ANALYZER.equals(keywordFieldMapper.fieldType().normalizer()))
             && (keywordFieldMapper.fieldType().isStored() || keywordFieldMapper.fieldType().hasDocValues());
     }

1340-1350: Default value of -1 for ignoredLength may cause unexpected behavior.

If no keyword sub-fields support derived source, getIgnoredLengthForDerivedSourceSupportedKeyword returns -1. This is later compared with value.length() in parseCreateField. Since any string length is > -1, this would trigger storing in the ignore field, which is correct behavior.

However, the use of -1 as a sentinel value is implicit. Consider documenting this or using a named constant.

+    // Sentinel value indicating no derived-source-supported keyword exists
+    private static final int NO_KEYWORD_IGNORE_LENGTH = -1;
+
     private static int getIgnoredLengthForDerivedSourceSupportedKeyword(MultiFields multiFields) {
-        int ignoredLength = -1;
+        int ignoredLength = NO_KEYWORD_IGNORE_LENGTH;
         for (Mapper mapper : multiFields) {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f8d381 and 88ac318.

📒 Files selected for processing (17)
  • CHANGELOG.md (1 hunks)
  • server/src/internalClusterTest/java/org/opensearch/get/GetActionIT.java (3 hunks)
  • server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java (3 hunks)
  • server/src/internalClusterTest/java/org/opensearch/search/simple/SimpleSearchIT.java (1 hunks)
  • server/src/internalClusterTest/java/org/opensearch/update/UpdateIT.java (1 hunks)
  • server/src/main/java/org/opensearch/index/mapper/CompositeFieldValueFetcher.java (1 hunks)
  • server/src/main/java/org/opensearch/index/mapper/FieldValueFetcher.java (1 hunks)
  • server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java (6 hunks)
  • server/src/main/java/org/opensearch/index/mapper/MatchOnlyTextFieldMapper.java (1 hunks)
  • server/src/main/java/org/opensearch/index/mapper/StringFieldType.java (2 hunks)
  • server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java (8 hunks)
  • server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (4 hunks)
  • server/src/test/java/org/opensearch/index/mapper/CompositeFieldValueFetcherTests.java (1 hunks)
  • server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java (5 hunks)
  • server/src/test/java/org/opensearch/index/mapper/ObjectMapperTests.java (6 hunks)
  • server/src/test/java/org/opensearch/index/mapper/TextFieldMapperTests.java (3 hunks)
  • server/src/test/java/org/opensearch/index/mapper/WildcardFieldMapperTests.java (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (3)
server/src/main/java/org/opensearch/common/lucene/Lucene.java (1)
  • Lucene (114-992)
server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java (1)
  • DerivedSourceHelper (1321-1366)
server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java (1)
  • DerivedSourceHelper (960-1005)
server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java (2)
server/src/main/java/org/opensearch/indices/IndicesModule.java (1)
  • IndicesModule (108-315)
server/src/main/java/org/opensearch/script/ScriptModule.java (1)
  • ScriptModule (54-127)
server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java (3)
server/src/main/java/org/opensearch/common/lucene/Lucene.java (1)
  • Lucene (114-992)
server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java (1)
  • DerivedSourceHelper (1321-1366)
server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (1)
  • DerivedSourceHelper (963-1015)
server/src/test/java/org/opensearch/index/mapper/WildcardFieldMapperTests.java (2)
server/src/main/java/org/opensearch/indices/IndicesModule.java (1)
  • IndicesModule (108-315)
server/src/main/java/org/opensearch/script/ScriptModule.java (1)
  • ScriptModule (54-127)
🔇 Additional comments (27)
server/src/internalClusterTest/java/org/opensearch/update/UpdateIT.java (1)

934-936: Verify text field behavior and mapping completeness for derived source optimization.

This test's text_field mapping (lines 934-936) removed explicit storage after this change but lacks a sub-keyword field, which the PR objectives indicate should provide doc_values for derived source. The test assertions (lines 969, 983, 1000, 1015, 1039, 1049) continue to expect text_field values to be retrievable.

Clarify:

  1. Whether the implementation provides a fallback mechanism for text fields without sub-keyword fields in derived source mode
  2. Whether this test should include a sub-keyword field to properly validate the optimization (e.g., "fields": {"keyword": {"type": "keyword"}})
server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java (1)

558-579: Good test coverage for derived source with ignore_above.

The test properly validates that values exceeding ignore_above are stored in the derived source ignore field and can be correctly retrieved during source derivation.

server/src/main/java/org/opensearch/index/mapper/WildcardFieldMapper.java (2)

216-222: Approve derived source storage for ignored values.

The logic correctly stores values exceeding ignore_above in a special derived source ignore field when the normalizer condition is satisfied and derived source is enabled. This enables proper source reconstruction.


924-929: Good relaxation of canDeriveSourceInternal guards.

The updated condition now permits deriving source when the normalizer is null or is the keyword analyzer, which aligns with the PR objectives. The early return on line 926-927 appropriately throws for unsupported normalizers.

server/src/test/java/org/opensearch/index/mapper/CompositeFieldValueFetcherTests.java (1)

27-132: Good test coverage for CompositeFieldValueFetcher.

The tests comprehensively cover the key behaviors:

  • Priority-based fetching (first non-empty wins)
  • Early termination when values are found
  • Null/empty handling from all fetchers
  • Empty fetcher list edge case
  • IOException propagation

The mock setup with conversion validation ("CONVERTED:<value>") effectively verifies the conversion flow through inner fetchers.

server/src/test/java/org/opensearch/index/mapper/WildcardFieldMapperTests.java (1)

382-436: Good test coverage for wildcard field derived source scenarios.

The tests properly validate:

  • Values exceeding ignore_above are stored in the derived source ignore field
  • Values below threshold produce normal fields without the ignore field
  • Derived source can be reconstructed from the ignore field
server/src/main/java/org/opensearch/index/mapper/CompositeFieldValueFetcher.java (1)

32-48: Verify callers handle null return value.

The fetch() method returns null when no fetcher yields values (line 47). Ensure that all callers of this method handle the null case appropriately to avoid NullPointerException.

server/src/test/java/org/opensearch/index/mapper/TextFieldMapperTests.java (5)

1093-1117: LGTM! Good test coverage for stored field derivation.

The test properly verifies that when store: true is set, the stored field is correctly retrieved via deriveSource(). The test setup, document parsing, and assertions are well-structured.


1119-1131: LGTM! Validates keyword sub-field detection for derived source.

Test correctly verifies that a text field with a keyword sub-field sets hasDerivedSourceSupportedKeyword to true and defaults keywordIgnoredLengthForDerivedSource to Integer.MAX_VALUE when no ignore_above is specified.


1147-1161: LGTM! Tests max ignore_above selection across multiple keywords.

Correctly verifies that when multiple keyword sub-fields exist with different ignore_above values, the maximum value (20) is selected, which aligns with the PR objective to minimize duplication.


1163-1175: LGTM! Tests normalized keyword exclusion.

Correctly verifies that a keyword sub-field with a normalizer is not considered a valid derived-source-supported keyword, which is necessary since normalization would alter the original value.


1177-1195: LGTM! Validates ignore field storage for long text exceeding ignore_above.

Test correctly verifies that when text length exceeds the keyword's ignore_above threshold, a stored field is added to the special derived-source ignore field for fallback retrieval.

server/src/main/java/org/opensearch/index/mapper/MatchOnlyTextFieldMapper.java (1)

140-154: LGTM! Consistent wiring of MultiFields.

The refactoring correctly builds MultiFields once and passes it through to both buildFieldType and the constructor, avoiding redundant builds and maintaining consistency with TextFieldMapper.

server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java (4)

399-428: LGTM! Proper integration of derived source keyword detection during field type build.

The logic correctly:

  1. Checks if derived source is enabled via index settings
  2. Detects if any keyword sub-field supports derived source
  3. Captures the maximum ignore_above value for threshold comparison

1079-1087: Potential performance concern: stored field added on every parse when conditions met.

The condition fieldType().getKeywordIgnoredLengthForDerivedSource() < value.length() correctly identifies values exceeding the threshold. However, consider caching fieldType().getHasDerivedSourceSupportedKeyword() and fieldType().getKeywordIgnoredLengthForDerivedSource() as local variables if this path is hot, though the current approach is readable and correct.


1321-1366: LGTM! Well-structured helper class for derived source keyword detection.

The DerivedSourceHelper inner class cleanly encapsulates:

  1. Detection of supported keyword sub-fields (no normalizer, has stored or doc_values)
  2. Finding maximum ignore_above across keywords
  3. Creating value fetchers for supported keywords

The logic correctly excludes normalized keywords since normalization would alter the original text value.


1313-1318: Verify FieldValueType.DOC_VALUES is appropriate here.

The getDerivedFieldPreference() returns DOC_VALUES, but the composite fetcher includes stored fields (from the ignore field and potentially from keyword stored fields). Ensure this preference correctly reflects the primary source for value retrieval, as the choice between doc_values and stored fields can impact performance and behavior.

server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java (1)

419-420: Removing store: true from text_field mappings is consistent with the new derived-source model

These derived-source rolling-restart tests only validate _source contents via getSourceAsMap and do not rely on text_field being a stored field. Dropping "store": true on text_field in all three mappings reduces redundant stored data and matches the direction of deriving text values from other field structures rather than from a stored text field itself.

Also applies to: 461-462, 546-547

server/src/internalClusterTest/java/org/opensearch/search/simple/SimpleSearchIT.java (1)

780-784: Derived-source mapping cleanup for text_field and wildcard_field looks sound

  • Dropping "store": true from text_field avoids redundant stored payloads now that derived source no longer depends on a stored text field here.
  • Relying on the wildcard field’s default doc_values instead of forcing "doc_values": true keeps the test aligned with the mapper’s defaults while preserving the intended derived-source behavior.
server/src/internalClusterTest/java/org/opensearch/get/GetActionIT.java (1)

827-833: Good targeted coverage of ignore_above via text-field keyword sub-field in derived source

The updated mapping, document value, and validateDeriveSource expectations together exercise the new behavior where:

  • text_field is not stored itself but has a keyword_field sub-field with ignore_above: 20.
  • The indexed text exceeds that threshold, ensuring the derived-source pipeline must handle ignored keyword values correctly while still reconstructing the full text_field string.
  • Re-checking before/after refresh and flush verifies the behavior across translog and on-disk retrieval.

This is a solid regression/behavioral test for the new derived-source strategy.

Also applies to: 856-856, 1087-1097

CHANGELOG.md (1)

77-77: Changelog entry accurately reflects the derived-source enhancements

The new bullet concisely documents the ignore_above support for keyword/wildcard and the text-field optimization under derived source, correctly linked to this PR.

server/src/main/java/org/opensearch/index/mapper/FieldValueFetcher.java (1)

53-56: Null-safe guard in FieldValueFetcher.write is appropriate

Changing the early return to values == null || values.isEmpty() makes the writer robust to fetchers that may return null instead of an empty list, which is particularly relevant now that composite fetchers are in play. No behavior change for existing non-null callers.

server/src/main/java/org/opensearch/index/mapper/StringFieldType.java (1)

69-69: Centralized ignore-value field naming for derived source looks good

Defining IGNORED_VALUE_FIELD_SUFFIX and exposing derivedSourceIgnoreFieldName() on StringFieldType standardizes the internal ignore-field naming and is reused by the derived-source helpers. This keeps the pattern consistent across keyword/wildcard (and any other string-like) mappers.

Also applies to: 260-262

server/src/test/java/org/opensearch/index/mapper/ObjectMapperTests.java (1)

695-825: Derived-source mapper validation tests now align well with the new rules

The updated testDeriveSourceMapperValidation covers the key constraints:

  • Nested fields and copy_to correctly rejected under index.derived_source.enabled = true.
  • Mappings with both doc_values and store disabled fail fast.
  • A broad multi-type mapping (including text and wildcard without extra overrides) validates the supported set.
  • Unsupported types like geo_shape are explicitly asserted to fail.

This gives good regression coverage for the refined derived-source eligibility logic.

server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java (3)

274-279: Derived-source eligibility and composite fetcher wiring for keyword fields are coherent

  • canDeriveSourceInternal now rejects fields with a non-default normalizer (anything other than null/KEYWORD_ANALYZER), which matches the requirement that we only derive source when we can faithfully reconstruct the original string.
  • derivedFieldGenerator() builds a CompositeFieldValueFetcher that:
    • Uses doc values when available, falling back to stored fields otherwise.
    • Adds a second fetcher targeting the ignore-value field, so values dropped by ignore_above remain reachable for derived source.
    • Declares a DOC_VALUES preference, which is consistent with the overall design.

The split between eligibility checks and the composite fetcher keeps the responsibilities clear.

Also applies to: 295-312


884-899: parseCreateField correctly persists ignored values for derived source while preserving ignore_above semantics

The updated parseCreateField logic:

  • Early-returns on value == null, avoiding unnecessary work.

  • For value.length() > ignoreAbove, still skips indexing (maintaining the ignore_above contract) but, when:

    • the normalizer is null or KEYWORD_ANALYZER,
    • derived source is enabled on the index, and
    • we’re not inside a multi-field context,

    it writes the raw value to a dedicated stored field named via fieldType().derivedSourceIgnoreFieldName().

This ensures that:

  • Queries and aggregations still ignore over-long values as before.
  • Derived source can reconstruct those values from the dedicated stored field on eligible top-level keyword fields without introducing behavior changes for other cases.

Also applies to: 904-917


960-1005: DerivedSourceHelper cleanly encapsulates primary and fallback fetchers for keyword-derived source

The new DerivedSourceHelper:

  • Produces the primary FieldValueFetcher using doc values when available or stored fields otherwise, and allows the caller to choose the logical “simple name” (e.g., mapping a keyword sub-field back to its parent text field).
  • Builds a dedicated MappedFieldType for the ignore-value field (using derivedSourceIgnoreFieldName()), with:
    • stored = true,
    • no search/query capabilities,
    • valueForDisplay decoding BytesRef to String,
      and wraps it in a StoredFieldFetcher.

This abstraction is shared both by keyword’s own derived-field generation and by text-field helpers that derive text source from keyword sub-fields, keeping the ignore-field wiring consistent.

Settings.EMPTY,
getPlugins().stream().filter(p -> p instanceof ScriptPlugin).map(p -> (ScriptPlugin) p).collect(toList())
);
ScriptService scriptService = new ScriptService(getIndexSettings(), scriptModule.engines, scriptModule.contexts);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential mismatch in ScriptService initialization.

getIndexSettings() returns settings from the base MapperTestCase, which may differ from the indexSettings variable being constructed with derived source enabled. This could lead to inconsistent behavior in tests.

Consider using the locally constructed indexSettings instead:

-        ScriptService scriptService = new ScriptService(getIndexSettings(), scriptModule.engines, scriptModule.contexts);
+        ScriptService scriptService = new ScriptService(indexSettings.getSettings(), scriptModule.engines, scriptModule.contexts);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In server/src/test/java/org/opensearch/index/mapper/KeywordFieldMapperTests.java
around line 632, the ScriptService is being constructed with getIndexSettings()
(from the base test) which can differ from the locally built indexSettings that
has derived source enabled; replace the call to getIndexSettings() with the
locally constructed indexSettings variable so the ScriptService uses the same
settings as the test, i.e., pass indexSettings to the ScriptService constructor
and ensure the indexSettings variable is in scope at that line.

return new Builder(simpleName(), indexAnalyzers).init(this);
}

static final class DerivedSourceHelper {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this separate namespace class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this from all three mappers.


// Explicitly add value as a stored field if value is getting ignored, to be able to derive the source
if (value.length() > ignoreAbove) {
if ((normalizer == null || Lucene.KEYWORD_ANALYZER.equals(normalizer))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not work if splitting queries on whitespace is set as well? Should we just check for default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizer is not being controlled by splitting queries on whitespace mapping parameter, if normalizer is not configured explicitly then default keyword(doesn't change anything on the string) one is being used.

}

public String derivedSourceIgnoreFieldName() {
return name() + IGNORED_VALUE_FIELD_SUFFIX;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause issues for existing mappings where users may have created field with this name? Can we only support this feature for index created above the new version?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should block users from creating field with these naming conventions for indices created beyond this version to avoid naming conflicts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added version check as well as, added mapping validation in DocumentMapper to not allow fields matching with internal field name .ingored_value.


@Override
protected MatchOnlyTextFieldType buildFieldType(FieldType fieldType, BuilderContext context) {
protected MatchOnlyTextFieldType buildFieldType(FieldType fieldType, MultiFields multiFields, BuilderContext context) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this break any place where this method is used in plugins?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated only MatchOnlyTextFieldMapper is extending TextFieldMapper across packages.


@Override
public ValueFetcher valueFetcher(QueryShardContext context, SearchLookup searchLookup, String format) {
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not expected to be called ever? Can we throw UnsupportedOperationException instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

this,
simpleName()
);
final FieldValueFetcher fallbackFieldValueFetcher = KeywordFieldMapper.DerivedSourceHelper.getFallbackFieldValueFetcher(this);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always need a fallbackValueFetcher? If normalizer/ignore_above is not set, would it only need primary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to not always consider fallbackValueFetcher and only for normalizer/ignore_above case.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jan 9, 2026
@tanik98 tanik98 force-pushed the derived-source-text-field branch from 88ac318 to 68ae5c7 Compare February 18, 2026 10:20
@tanik98 tanik98 changed the title Support ignore_above for keyword/wildcard field and optimise text field under derived source [Derived Source] Support ignore_above/normalizer for keyword/wildcard field and optimise text field lookup Feb 18, 2026
…nore_above for keyword field

Signed-off-by: Tanik Pansuriya <panbhai@amazon.com>
Signed-off-by: Tanik Pansuriya <panbhai@amazon.com>
…eld mappers

Signed-off-by: Tanik Pansuriya <panbhai@amazon.com>
Signed-off-by: Tanik Pansuriya <panbhai@amazon.com>
Signed-off-by: Tanik Pansuriya <panbhai@amazon.com>
@tanik98 tanik98 force-pushed the derived-source-text-field branch from 68ae5c7 to 659de98 Compare February 18, 2026 11:05
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 659de98: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants