-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
OpenSearch supports remote backed storage which allows users to backup the cluster's data on a remote storage thus providing stronger durability guarantee. The data is always hot(present locally on disk) and remote storage acts as a backup.
We can also leverage the remote store to support shard level data tiering where all data is not guaranteed to be present locally but instead can be fetched at runtime on demand basis(let's say warm shard). This will allow the storage to scale separately from compute and user would be able to store more data with same number of nodes.
Benefits:
- Lower cost to customers as they can store more data per node.
- Can be further leveraged for building data tiering like
hot,warm&cold(cold being data archived on a durable storage. exact terminologies can be decided later) - Fundamental building block for OpenSearch to be able to support serverless architecture with a writable remote storage.
Describe the solution you'd like
What is Writable warm:
A warm index is an OpenSearch index where the entire index data is not guaranteed to be present locally on disk. The shards are always open & assigned on one of the eligible nodes, the index is active and the metadata is part of the cluster state.
- With warm index support, the lucene index(shard) can operate by retaining only the "essential" data on disk and fetch the remaining data from remote store on-demand basis.
- Since the size of these segment files can be large(max 5 gb in current configuration), downloading entire files in critical write/search path would incur higher latencies(depending on file sizes & network). To minimize the impact, it should support downloading smaller blocks of data based on access pattern (Block level fetch).
Possible configurations:
- Non-dedicated warm node: All hot & warm shards reside on same node.
- Dedicated warm node: All the warm shards would reside on dedicated warm nodes(nodes with a dedicated node role
warm). This should help improve resource isolation between hot and warm shards and thus lower blast radius.
Scope:
Includes:
- Ability to read/write on a warm index.
- Support warm replicas.
- File cache management.
- Indexing backpressure.
Excludes(will be covered as a separate task):
- Tier management & APIs
- Shard rebalancing logic.
- Cluster manager optimizations.
- Optimizations on the search path.
Core Components:
Composite Directory:
The Directory abstraction provides a set of common APIs for creating, deleting, and modifying files within the directory, regardless of the underlying storage mechanism. Composite Directory will abstract out all the data tiering logic and makes the lucene index(aka shard) data locality agnostic.
File Cache:
File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.
Index Inputs:
For all read operations composite directory should be able to create/send different index inputs. For e.g:
- IndexInput for local files.
- Blocked IndexInput for remote files.
- Non-block IndexInput for remote files.
- IndexInput associated with a single downloaded block in FileCache.
Prefetcher:
To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.
User flows:
- Indexing flow:
- Write requests are written to indexing buffer, segment files are created during refresh which are then synced to disk.
- FileCache will start tracking the locally created files.
- Files are synced to remote store with each refresh.
- After files are uploaded, Composite Directory updates the file state. Files are now marked as eligible for eviction.
- Eventually we'd want the file sync logic to be abstracted behind the composite directory itself so that data transfer in both directions should be encapsulated in the directory.
- Write requests are written to indexing buffer, segment files are created during refresh which are then synced to disk.
- Replication & Recovery flow:
- During recovery from remote store, we should only download the segment info and open the reader. All the required files are fetch at runtime(can be prefetched as an optimization)
- Search flow:
- For any file read which is not present locally, File cache will download the block for reads at runtime.
High level changes (more details to be added in individual issues):
- Create warm index and integrate with FileCache & composite directory (link)
- Add replication & recovery flow for warm indices (replication, shard relocation, primary promotion).
- File Cache eviction strategy.
- Exposing cache stats & eviction metrics.
- Supporting encryption for block based downloads.
- Indexing flow optimizations (based on performance brenchmarking).
- Prefetch optimizations.
- Introducing Indexing backpressure for warm shard.
- Separate read/write threadpool for warm indices(must have for non-dedicated setup).
Performance optimizations:
- Cache header & footer blocks for each segment file
- Requires special handling for compound files.
- Block size(e.g. 8 MB) would be bigger than the header/footer(let’s say 32 bytes). If we’re to always retain header & footer blocks for each file in FileCache, we’ll be caching much more data that we actually need. We can separately handle the header & footer block download with smaller sizes.
- Tune the TTL for newly created segment files to minimize redownloading the segment during merges(on best-effort basis)
- Tuning for merges:
- Tune merge parameters separately for warm indices behind dedicated settings.
- Based on performance benchmarks, we can optimize toward 1. lowering the number of segments for optimal search performance, and 2. increasing the max segment sizes.
Other considerations:
Hot/warm migration strategy:
To be able to provide the best user experience for data tiering(hot/warm/cold), we should be able to seamlessly migrate the cluster from hot to warm. To support this, here are few key decisions:
- For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).
- This is useful when we're hosting the warm & hot indices on same node. For dedicated warm node setup, tracking hot files in FileCache doesn't bring any advantage.
- Would require additional efforts to ensure that the FileCache doesn't become the bottleneck for hot shards.
- Similarly, we would be using the same engine as hot index(
InternalEnginefor primary &NRTReplicationEnginefor replica). We might still need to change few characteristics for warm shard(e.g. refresh interval, translog rollover duration, merge policy thresholds, etc)
Note: Migration strategy to be covered in a separate issue.
Supporting custom codecs
One caveat with prefetching is that to prefetch certain blocks you might need to read the files using methods which are not really exposed by the respective readers in lucene. Few examples:
- To be able to fetch header & footer blocks for compound files, we'll need to read the entry file and identify all the index files & their offsets.
- For optimizing search path, we might want to prefetch the
StoredFieldsfor certain documents which requires getting the file offset for the document ID.
These use case require us to support methods which aren't exposed thus requiring us to maintain lucene version specific logic to support the prefetch operation.
This restriction also breaks the compatibility of prefetch logic with custom plugins. To get around this, we can disable prefetch for custom codes but it'll essentially restricts users from using other codecs. Alternatively(preferred), we can ask the codec owners to support the prefetch logic. We can expose the required methods behind a new plugin interface that should be implemented by codec owner if they want to support warm indices.
Future Enhancements
Here are some enhancements to explore after the performance benchmarking.
- Exploring optimal merge policy for warm shards. Either we can tune the existing
TieredMergePolicywith separate merge settings for warm shard. Or, if required, we can explore a different policy that minimizes the number of segments on remote store. - Exploring other caching strategies instead of LRU. LRU isn't resistant to overscan(segment merges) and can potentially impact the cache effectiveness. We can explore other alternatives like ARC and LIRS(Thanks @Bukhtawar for recommendation)
- We can track the working set for each shard and persist the information in remote store metadata. Working set for a shard can be the set of most used file blocks. During shard initialization/relocation, the working set can be downloaded before marking the shard as active.
FAQs:
Q: How are files evicted from disk?
Files can be evicted from disk due to:
- Cache overflow: Apart from node level usage, we can also track the shard level usage and evict older files.
- TTL: Depending on use case, user might want either of below option:
- Reduce the shard footprint over time with TTL based evictions.
- Retain the files locally as long as the FileCache has available space.
Q: How would merges work for warm shard?
Merges can require downloading the files from remote. Eventually we'd want to have support for offline merge on a dedicated node(related issue).
Q: How to determine the essential data to be fetched for a shard?
We can actively track the most used block files based on FileCache statistics. This information can be retained in remote store which creates the set of essential files to be downloaded while initializing a shard.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status
Status
