Skip to content

[Feature Request] Allow custom document parsing by skipping OS default mapping framework #19550

@varunbharadwaj

Description

@varunbharadwaj

Is your feature request related to a problem? Please describe

The indexing request today is parsed in IndexShard to generate a ParsedDocument instance containing Lucene field instances along with the _source. This is then handed off to the Engine for indexing. For custom plugins such as the tsdb plugin (#19461), we don't really need OpenSearch indexing flow to parse the document. The parsing can be handled within the plugin (MetricsEngine) for better throughput and CPU utilization.

Benchmarks show ~18% improved ingestion throughput (wps/core) and P95 CPU utilization by skipping the default parsing and using customized parsing logic that suits the metrics use case. Under extreme load, this optimization further shows 57% improvement in wps/core and 25% P95 core utilization.

This feature request is to support such custom use cases by allowing them to bypass OS default mapping/parsing framework.

Describe the solution you'd like

Introduce a new setting skip_default_document_parsing to skip the default index mapping and document parsing in indexing flow, and instead do the following in IndexShard.

operation = new Engine.Index(
            new Term(IdFieldMapper.NAME, Uid.encodeId(sourceToParse.id())),
            new ParsedDocument(null, null, sourceToParse.id(), null, null, sourceToParse.source(), sourceToParse.getMediaType(), null),
            seqNo,
            opPrimaryTerm,
            version,
            versionType,
            origin,
            System.nanoTime(),
            autoGeneratedTimeStamp,
            isRetry,
            ifSeqNo,
            ifPrimaryTerm
        );

        return index(engine, operation);

This creates a ParsedDocument containing only the _source and other required information. The custom engine implementations can handle document parsing by looking up _source.

Related component

Indexing

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingIndexing, Bulk Indexing and anything related to indexingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions