[Draft][RFC] Multi-tiered file cache

**Introduction**
As an extension to [writeable warm tier](https://github.com/opensearch-project/OpenSearch/issues/7764) and [fetch-on-demand composite directory](https://github.com/opensearch-project/OpenSearch/issues/7331) feature in OpenSearch we are proposing to introduce a new abstraction which makes it seamless for composite directory abstraction to read files without worrying about the locality of the file. 
Today in OpenSearch (with remote store ), for the shards to be able to accept reads or writes, there is a need to download full data on disk from remote store.  
With the file cache (and [fetch-on-demand composite directory](https://github.com/opensearch-project/OpenSearch/issues/7331) ) the shards can accept reads and writes without keeping whole data on local store.

**Proposed Solution**
We are proposing a new FileCache abstraction that manages the lifecycle of all committed (segment) files for the shards present on a node. 
<img width="573" alt="Screenshot 2023-07-12 at 9 13 31 AM" src="https://github.com/opensearch-project/OpenSearch/assets/134213320/7903b896-6003-40c0-a460-efc56c3eaa9a">

FileCache provides three storage tiers for the files.

* Tier1  : Entries in tier1 are memory mapped files.
* Tier2:  Entries in tier2 are pointers to files present on local storage.
* Tier3 : Entries in tier3 are pointer to files in remote storage.

<img width="299" alt="Screenshot 2023-07-12 at 9 25 35 AM" src="https://github.com/opensearch-project/OpenSearch/assets/134213320/9ae60626-4eef-4cb9-bb79-582710761b79">

As the node boots up file cache is initialised with a maximum capacity of tier1 and tier2, where capacity is defined as the total size in bytes of the files associated with entries in file cache.
Each shard’s primary persist its working set as part of the commit, where working set contains the names of the segment file and the respective file cache tiers these files are present in. During shard recovery, working set of a shard is used to populate the file cache with segment files across different tiers. 
An entry is created in tier2 and tier3 for all the files after every upload to remote store. Newly written files that are not committed are not managed by FileCache.

File cache internally has a periodic file lifecycle manager that evaluates lifecycle policies for each tier and take action on it. 
Lifecycle policies for a file cache can be set as part of cluster settings. We are defining following major cluster settings for the same 

* Capacity of tier1 and tier2
* TTLs for tier1 and tier2.
* Tier3 stale commit retention [TBD]
* Configuring tier2 to be in sync with tier3 (This will keep all data as hot)

**Tier1 lifecycle policies**

Files which are not actively used in tier1 (0 reference count) are eligible for movement from tier1 to tier2 based on following policies.

* Every entry in the tier1 has a ttl value associated with it (which resets after every access). Entries with ttl less than tier1_ttl are marked for movement from tier1 to tier2.
* If the tier1 capacity has reached a threshold (x% of max tier1 capacity) and there are no entries marked for movement from ttl entries are moved in a lru fashion to tier2.

**Tier2 lifecycle policies**

Files which are not actively used in tier2 (0 reference count) are eligible for movement from tier2 to tier1 based on following policies.

* Every entry in the tier2 has a ttl value associated with it. Entries with ttl less than tier2_ttl are marked for movement from tier2 to tier3.
* If the tier2 capacity has reached a threshold (x% of max tier2 capacity) and there are no entries marked for movement from ttl, entries are removed from tier2 in lfu fashion .

**Tier3 lifecycle policies**

Files with stale commits older than X days are eligible for eviction from tier3. This can be done as a later work.


**File Cache Stats**

Cache also provides metrics around its usage, evictions, hit rate, miss rate etc at a shard granularity as well as at a node granularity.

**File Cache Admission Control and throttling**

We are also going to add an admission control on file cache based on current capacity, maximum configured capacity and watermark settings of file cache which can result in a read/write block on a node.
We are also proposing to track the total bytes downloaded from remote store and total bytes removed from local store by life cycle manager and throttle downloading of new files from remote store based on a total bytes downloaded on disk as measured against disk watermark settings.

**Potential Issues**

* We may see higher latencies on updates. 
* Segment merging can have more impact on read/write latencies.

**Future work**

* Marking stale commits for deletion in tier3. 
   - File cache lifecycle manager can potentially mark the stale commits in tier3 for deletion. 
* Tier3 capacity management
    - File cache lifecycle manager can define bounds on number of days to store a stale commit in remote store.
* Offload segment merges from primary writer to avoid shard working set pollution because of merges.
   - Segment merges have tendency to pollute working set of a shard and file cache. We can potentially offload merges away from primary writer.
* File cache lifecycle based on IOContext.
   - File cache api can potentially take in IO context as an input, specifying the operation for which the file is accessed (e.g Merge, Read etc). This context can be used to define  lifecycle policies based on operations.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft][RFC] Multi-tiered file cache #8891

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Draft][RFC] Multi-tiered file cache #8891

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions