[RFC] Automatic model training during indexing

## Overview

This RFC proposes automatic model training during Lucene's natural merge process for k-NN encoders that require a training step. This provides an alternative to the current 6-step manual training workflow. With automatic training, index creation requires only a single `PUT /index` followed by bulk indexing, matching the workflow of training-free encoders.

This RFC was motivated by LeanVec, a compression technique introduced in the companion RFC ["[RFC] Add Intel SVS (Scalable Vector Search) as a new Faiss index type"](https://github.com/opensearch-project/k-NN/issues/3224). LeanVec requires a training step to learn its dimensionality-reduction transform. The core mechanism (threshold-triggered training during merge, model persistence as segment files, fallback encoding) is designed to be encoder-agnostic and could extend to other training-required configurations like PQ in the future.

## Problem Statement

Training-required configurations like IVF (method) and PQ (encoder) use a **6-step manual workflow**. Without automatic training, LeanVec would require the same steps:

```
1. Create a temporary "training index"
2. Bulk-index representative vectors (e.g., 100K-1M vectors)
3. POST /_plugins/_knn/models/{model_id}/_train       -- user must call this
4. Poll GET /_plugins/_knn/models/{model_id}           -- user must wait
5. Create the real index with model_id reference
6. Bulk-index ALL vectors into the real index
```

Every training-free configuration works with a single `PUT /index` followed by bulk indexing.

## Requirements

### Functional

- **FR-1**: Training triggers automatically during Lucene merges when configurable vector count thresholds are reached (one for an initial model, one for the final model)
- **FR-2**: Search works immediately from the start, using a fallback encoding during the pre-training phase
- **FR-3**: Trained models persist across node restart, shard relocation, and snapshot/restore
- **FR-4**: Each shard trains independently. No cross-shard coordination required
- **FR-5**: Indexing and search remain available during training
- **FR-6**: Backward compatible. The existing `model_id` / `_train` API workflow must remain functional for LeanVec. Automatic training activates when LeanVec is selected without a `model_id`. If both are specified, `model_id` takes precedence
- **FR-7**: Stats API exposes training progress and status
- **FR-8**: Model quality level persists across restart. A shard in steady state (FINAL model) must not retrain after recovery

### Non-functional

- **NFR-1**: Training overhead is bounded: at most two successful training events per field (initial and final), with a circuit breaker suppressing retries on persistent errors


## Proposed Solution

### Four-Phase Lifecycle

During indexing, Lucene periodically writes buffered documents to new segments on disk (**flush**), then combines small segments into larger ones (**merge**). Automatic training hooks into this process:

- **Flush** writes buffered vectors to a new segment. The flush path never triggers training. It uses the current model if one exists, or falls back to LVQ.
- **Merge** combines segments into larger ones. This is where training is triggered, because a merge naturally accumulates enough vectors to produce a useful model.

The lifecycle is controlled by two user-configurable parameters on the encoder:

- `initial_training_threshold`: number of live vectors in the merge that triggers the first (rough) model. Default: 10,000. Minimum: 1,000.
- `training_threshold`: number of live vectors in the merge that triggers the final (production) model. Default: 100,000. Minimum: 1,000.

These thresholds compare against vectors in the specific merge operation, not the cumulative shard count.

If `initial_training_threshold` exceeds `training_threshold`, the values are swapped automatically.

```
Phase A: PRE-TRAINING
  User indexes vectors normally.
  Segments are built with a fallback encoding (LVQ for LeanVec).
  Search works immediately against fallback segments.

                    |
                    | merge with >= initial_training_threshold vectors
                    v

Phase B: INITIAL TRAINING (one-time, per shard)
  Triggered during a natural Lucene merge.
  Trains a rough model from vectors in the merge.
  Stores model as a Lucene segment file.
  All subsequent segments use the initial model.
  Indexing and search remain available during training.

                    |
                    | merge with >= training_threshold vectors
                    v

Phase C: FINAL TRAINING (one-time, per shard)
  Retrains from a larger merge for production-quality model.
  Replaces initial model in cache. New segments use the final model.
  Old segments are replaced with final encoding as Lucene naturally merges them.

                    |
                    v

Phase D: STEADY STATE
  All future segments use the final trained model.
  Model persists with shard data (survives restart, relocation, snapshot/restore).
```

Setting both thresholds equal skips Phase B and produces single-threshold behavior. The two-threshold approach reduces time spent in fallback encoding: the initial threshold is reached sooner, so segments start using the trained encoding earlier.

How flush and merge use the model:

- **Flush**: checks the cache. If a model exists, encode as LeanVec. Otherwise, encode as LVQ (fallback).
- **Merge**: checks the cache. If a model exists, use it (may upgrade INITIAL to FINAL if the merge qualifies). If no model exists and the merge is large enough, train one. Otherwise, encode as LVQ (fallback).

### Implementation Details

#### Model Storage (.knnlvm segment files)

The existing `_train` API stores models in a system index (`.opensearch-knn-models`), referenced by `model_id`. This works but couples the index to a cluster-wide system index and requires explicit model management.

Automatic training stores models as Lucene segment files (`.knnlvm`), alongside the existing `.faiss` and `.osknnqstate` files that the k-NN plugin already writes per segment. The file format uses Lucene's standard `CodecUtil` header, footer, and CRC32 verification — the same pattern the plugin uses for quantization state persistence. This removes the dependency on the cluster-wide system index: the model travels with the shard and is automatically replicated, snapshotted, and recovered by Lucene's existing segment management.

Each segment built with a trained model contains a `.knnlvm` file alongside its `.faiss` file. 

#### Model Blob

When training runs, the Faiss JNI layer returns an opaque byte array containing the trained model state. The plugin treats this blob as an opaque payload and does not inspect or modify the contents. The blob is everything Faiss needs to build LeanVec-encoded segments going forward.

The blob flows through the system in five stages:

```
1. TRAIN:    merge triggers JNIService.trainIndex()  -->  byte[] blob returned
2. CACHE:    blob + quality stored in per-shard in-memory cache
3. PERSIST:  blob written to .knnlvm segment file (survives restart)
4. USE:      next flush/merge reads blob from cache, passes to JNI for index building
5. RECOVER:  on shard restart, blob loaded from .knnlvm back into cache
```

#### Per-Shard In-Memory Cache

A new per-shard in-memory cache holds the trained model so that flush and merge can use it without reading from disk each time. The cache tracks model quality (NONE → INITIAL → FINAL) and only allows upgrades, never downgrades.

#### Concurrency and Failure Handling

If two merges both qualify to train, only one proceeds. The other uses the existing model (if available) or falls back to LVQ. The merge that triggers training waits for it to complete; other concurrent merges do not block.

If training fails, a circuit breaker suppresses retries and the shard continues with the existing model or LVQ fallback.

#### Shard Recovery

On node restart, shard relocation, or snapshot restore, the k-NN plugin scans committed segments for `.knnlvm` files and restores the highest-quality model into the cache. If recovery fails, the shard starts normally and falls back to LVQ until the next qualifying merge trains a new model.

#### Fallback Encoding

When no model is available, the index writer swaps the LeanVec component in the Faiss index factory string to LVQ:

```
SVSVamana64,LeanVec4x8_192  →  SVSVamana64,LVQ4x8
```

LVQ and LeanVec segments coexist during search, and old LVQ segments are replaced with LeanVec segments as Lucene naturally merges them.

## User Experience

With automatic training, the user creates the index and starts bulk indexing. Training happens during merge.

```json
PUT /my-index
{
  "settings": { "index": { "knn": true } },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "svs_vamana",
          "engine": "faiss",
          "space_type": "l2",
          "parameters": {
            "degree": 64,
            "encoder": {
              "name": "leanvec",
              "parameters": {
                "primary_bits": 4,
                "residual_bits": 8,
                "dimensions": 192,
                "training_threshold": 100000,
                "initial_training_threshold": 10000
              }
            }
          }
        }
      }
    }
  }
}
```

The `training_threshold` and `initial_training_threshold` parameters enable automatic training. The `model_id` and `_train` API are not needed.

## Related RFCs

- **[[RFC] Add Intel SVS (Scalable Vector Search) as a new Faiss index type](https://github.com/opensearch-project/k-NN/issues/3224)**: Adds the `svs_vamana` method and its encoders (LVQ, LeanVec, flat, sq). Without automatic training, LeanVec requires the existing `model_id` / `_train` workflow.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Automatic model training during indexing #3225

Overview

Problem Statement

Requirements

Functional

Non-functional

Proposed Solution

Four-Phase Lifecycle

Implementation Details

Model Storage (.knnlvm segment files)

Model Blob

Per-Shard In-Memory Cache

Concurrency and Failure Handling

Shard Recovery

Fallback Encoding

User Experience

Related RFCs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Automatic model training during indexing #3225

Description

Overview

Problem Statement

Requirements

Functional

Non-functional

Proposed Solution

Four-Phase Lifecycle

Implementation Details

Model Storage (.knnlvm segment files)

Model Blob

Per-Shard In-Memory Cache

Concurrency and Failure Handling

Shard Recovery

Fallback Encoding

User Experience

Related RFCs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions