Skip to content

[Feature Request] Make batch ingestion automatic, not a parameter on _bulk #14283

@andrross

Description

@andrross

Is your feature request related to a problem? Please describe

A new batch method was added to o.o.ingest.Processor interface in #12457 that allows ingest processor to operate on multiple documents simultaneous, instead of one by one. For certain processors, this can allow for much faster and more efficient processing. However, a new batch_size parameter was added to the _bulk API with a default value of 1. This means in order to benefit from batch processing of any of my ingest processors, I have to do at a minimum two this: determine how many documents to include in my _bulk request and determine the optimal value to set for this batch_size parameter. I must also change all my ingestion tooling to support this new batch_size parameter and to specify it.

Describe the solution you'd like

I want the developers of my ingestion processors to determine good defaults for how they want to handle batches of documents, and then I can see increased performance with no change to my ingestion tooling by simply updating to the latest software. I acknowledge I may still have to experiment with finding the optimal number of documents to include in each _bulk request (this is the status quo for this API and not specific to ingest processors). Also, certain ingest processors may define expert-level configuration options to further optimize if necessary, but I expect the defaults to work well most of the time and to almost always be better than the performance I saw before batching was implemented.

Related component

Indexing

Additional context

I believe this can be implemented as follows:

  • [required] Increase the default value of batch_size from 1 to Integer.MAX_VALUE. This means that by default the entire content of my bulk request will be passed to each ingest processor. However, the default implemention of batchExecute just operates on one document at a time, so unless my ingest processor is updated to leverage the new batchExecute method, I will see exactly the same behavior that existed previously.
  • [optional] Emit a deprecation warning if batch_size is specified in the _bulk API
  • [optional] Remove the functionality of the batch_size parameter in the _bulk API on main

Additional discussion exists on the original RFC starting here: #12457 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingIndexing, Bulk Indexing and anything related to indexingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions