-
Notifications
You must be signed in to change notification settings - Fork 187
[RFC] Automatic model training during indexing #3225
Description
Overview
This RFC proposes automatic model training during Lucene's natural merge process for k-NN encoders that require a training step. This provides an alternative to the current 6-step manual training workflow. With automatic training, index creation requires only a single PUT /index followed by bulk indexing, matching the workflow of training-free encoders.
This RFC was motivated by LeanVec, a compression technique introduced in the companion RFC "[RFC] Add Intel SVS (Scalable Vector Search) as a new Faiss index type". LeanVec requires a training step to learn its dimensionality-reduction transform. The core mechanism (threshold-triggered training during merge, model persistence as segment files, fallback encoding) is designed to be encoder-agnostic and could extend to other training-required configurations like PQ in the future.
Problem Statement
Training-required configurations like IVF (method) and PQ (encoder) use a 6-step manual workflow. Without automatic training, LeanVec would require the same steps:
1. Create a temporary "training index"
2. Bulk-index representative vectors (e.g., 100K-1M vectors)
3. POST /_plugins/_knn/models/{model_id}/_train -- user must call this
4. Poll GET /_plugins/_knn/models/{model_id} -- user must wait
5. Create the real index with model_id reference
6. Bulk-index ALL vectors into the real index
Every training-free configuration works with a single PUT /index followed by bulk indexing.
Requirements
Functional
- FR-1: Training triggers automatically during Lucene merges when configurable vector count thresholds are reached (one for an initial model, one for the final model)
- FR-2: Search works immediately from the start, using a fallback encoding during the pre-training phase
- FR-3: Trained models persist across node restart, shard relocation, and snapshot/restore
- FR-4: Each shard trains independently. No cross-shard coordination required
- FR-5: Indexing and search remain available during training
- FR-6: Backward compatible. The existing
model_id/_trainAPI workflow must remain functional for LeanVec. Automatic training activates when LeanVec is selected without amodel_id. If both are specified,model_idtakes precedence - FR-7: Stats API exposes training progress and status
- FR-8: Model quality level persists across restart. A shard in steady state (FINAL model) must not retrain after recovery
Non-functional
- NFR-1: Training overhead is bounded: at most two successful training events per field (initial and final), with a circuit breaker suppressing retries on persistent errors
Proposed Solution
Four-Phase Lifecycle
During indexing, Lucene periodically writes buffered documents to new segments on disk (flush), then combines small segments into larger ones (merge). Automatic training hooks into this process:
- Flush writes buffered vectors to a new segment. The flush path never triggers training. It uses the current model if one exists, or falls back to LVQ.
- Merge combines segments into larger ones. This is where training is triggered, because a merge naturally accumulates enough vectors to produce a useful model.
The lifecycle is controlled by two user-configurable parameters on the encoder:
initial_training_threshold: number of live vectors in the merge that triggers the first (rough) model. Default: 10,000. Minimum: 1,000.training_threshold: number of live vectors in the merge that triggers the final (production) model. Default: 100,000. Minimum: 1,000.
These thresholds compare against vectors in the specific merge operation, not the cumulative shard count.
If initial_training_threshold exceeds training_threshold, the values are swapped automatically.
Phase A: PRE-TRAINING
User indexes vectors normally.
Segments are built with a fallback encoding (LVQ for LeanVec).
Search works immediately against fallback segments.
|
| merge with >= initial_training_threshold vectors
v
Phase B: INITIAL TRAINING (one-time, per shard)
Triggered during a natural Lucene merge.
Trains a rough model from vectors in the merge.
Stores model as a Lucene segment file.
All subsequent segments use the initial model.
Indexing and search remain available during training.
|
| merge with >= training_threshold vectors
v
Phase C: FINAL TRAINING (one-time, per shard)
Retrains from a larger merge for production-quality model.
Replaces initial model in cache. New segments use the final model.
Old segments are replaced with final encoding as Lucene naturally merges them.
|
v
Phase D: STEADY STATE
All future segments use the final trained model.
Model persists with shard data (survives restart, relocation, snapshot/restore).
Setting both thresholds equal skips Phase B and produces single-threshold behavior. The two-threshold approach reduces time spent in fallback encoding: the initial threshold is reached sooner, so segments start using the trained encoding earlier.
How flush and merge use the model:
- Flush: checks the cache. If a model exists, encode as LeanVec. Otherwise, encode as LVQ (fallback).
- Merge: checks the cache. If a model exists, use it (may upgrade INITIAL to FINAL if the merge qualifies). If no model exists and the merge is large enough, train one. Otherwise, encode as LVQ (fallback).
Implementation Details
Model Storage (.knnlvm segment files)
The existing _train API stores models in a system index (.opensearch-knn-models), referenced by model_id. This works but couples the index to a cluster-wide system index and requires explicit model management.
Automatic training stores models as Lucene segment files (.knnlvm), alongside the existing .faiss and .osknnqstate files that the k-NN plugin already writes per segment. The file format uses Lucene's standard CodecUtil header, footer, and CRC32 verification — the same pattern the plugin uses for quantization state persistence. This removes the dependency on the cluster-wide system index: the model travels with the shard and is automatically replicated, snapshotted, and recovered by Lucene's existing segment management.
Each segment built with a trained model contains a .knnlvm file alongside its .faiss file.
Model Blob
When training runs, the Faiss JNI layer returns an opaque byte array containing the trained model state. The plugin treats this blob as an opaque payload and does not inspect or modify the contents. The blob is everything Faiss needs to build LeanVec-encoded segments going forward.
The blob flows through the system in five stages:
1. TRAIN: merge triggers JNIService.trainIndex() --> byte[] blob returned
2. CACHE: blob + quality stored in per-shard in-memory cache
3. PERSIST: blob written to .knnlvm segment file (survives restart)
4. USE: next flush/merge reads blob from cache, passes to JNI for index building
5. RECOVER: on shard restart, blob loaded from .knnlvm back into cache
Per-Shard In-Memory Cache
A new per-shard in-memory cache holds the trained model so that flush and merge can use it without reading from disk each time. The cache tracks model quality (NONE → INITIAL → FINAL) and only allows upgrades, never downgrades.
Concurrency and Failure Handling
If two merges both qualify to train, only one proceeds. The other uses the existing model (if available) or falls back to LVQ. The merge that triggers training waits for it to complete; other concurrent merges do not block.
If training fails, a circuit breaker suppresses retries and the shard continues with the existing model or LVQ fallback.
Shard Recovery
On node restart, shard relocation, or snapshot restore, the k-NN plugin scans committed segments for .knnlvm files and restores the highest-quality model into the cache. If recovery fails, the shard starts normally and falls back to LVQ until the next qualifying merge trains a new model.
Fallback Encoding
When no model is available, the index writer swaps the LeanVec component in the Faiss index factory string to LVQ:
SVSVamana64,LeanVec4x8_192 → SVSVamana64,LVQ4x8
LVQ and LeanVec segments coexist during search, and old LVQ segments are replaced with LeanVec segments as Lucene naturally merges them.
User Experience
With automatic training, the user creates the index and starts bulk indexing. Training happens during merge.
PUT /my-index
{
"settings": { "index": { "knn": true } },
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 768,
"method": {
"name": "svs_vamana",
"engine": "faiss",
"space_type": "l2",
"parameters": {
"degree": 64,
"encoder": {
"name": "leanvec",
"parameters": {
"primary_bits": 4,
"residual_bits": 8,
"dimensions": 192,
"training_threshold": 100000,
"initial_training_threshold": 10000
}
}
}
}
}
}
}
}The training_threshold and initial_training_threshold parameters enable automatic training. The model_id and _train API are not needed.
Related RFCs
- [RFC] Add Intel SVS (Scalable Vector Search) as a new Faiss index type: Adds the
svs_vamanamethod and its encoders (LVQ, LeanVec, flat, sq). Without automatic training, LeanVec requires the existingmodel_id/_trainworkflow.