[Discuss] Design for Tiered File Cache & Block Level fetch

Coming from RFC for Multi-tiered [File Cache](https://github.com/opensearch-project/OpenSearch/issues/8891), this issue captures the low level components and the flows. 

Multi-tiered File Cache works along with Directory abstraction(for storing lucene segment files) which allows lucene indices(a.k.a shards) to be agnostic of the data locality(memory, local disk, remote store). Internally it implements a Multi Tiered Cache(Tier 1 Memory mapped, Tier 2 Disk, Tier 3 Remote store) and takes care of managing the data across all tiers as well as its movement.

### Use case: 

Ideally, all types of indices(local/warm/remote backed hot) can be managed with the File Cache where depending on shard type, tiers and its promotion criteria can be different. For current scope however, we intend to enable this support for remote based warm indices as it gains the maximum benefit from this abstraction.

This is helpful as it'll allow us to write the data to a warm shard without loading the entire data thereby improving the time to recovery. Similarly, we can also explore enabling support for warm shards with 0 replicas (for non-critical indices). Another benefit(with Block level fetch) is the lower memory footprint of the shard on the node.

This component will allows us to lazy load the shard’s data for read and write on-demand basis. This is helpful as it'll allow us to write the data to a warm shard without loading the entire data thereby improving the time to recovery. Similarly, we can also explore enabling support for warm shards with 0 replicas (for non-critical indices). Another benefit(with Block level fetch) is the lower memory footprint of the shard on the node.

This component will also allow us to introduce a working set model for every shard, where working set for a shard is defined as the set of files (or blocks of files) which a shard is currently using (for reads/writes/merges etc). With working set model the shards will have the capability to lazy load the data needed into the shard’s working set. Since working set of a shard is going to be a function of time it is expected that the files (or blocks of files) part of working set are going to be evicted and added with time in the file-cache on-demand basis.

**Caveats:** 

* Tier 1 boundary is not well defined. Data can be memory mapped but system can still thrash the data to disk. 
* Tier 2 to Tier 3 promotion is not handled internally. With the current implementation, the hooks & logic to migrate data to remote store is outside the abstractions. We can revisit in later iterations.
* Tier 3 lifecycle is not really managed in File Cache(at least yet) (e.g. deleting stale data, deleting complete data on remote).


Tier 3 is the source of truth for the shard. As soon as new data gets uploaded to the remote store, File cache starts tracking the new files. 

---

### How it works

* File Cache will track/maintain the list of files for all shards(using composite directory) at node level. 
* Internally it’ll maintain a TierMap to track files and the respective tiers.
* For Tier 1 and Tier 2, it’ll maintain separate LRU caches (for disk and memory). Both having their own cache lifecycle manager and policies. 
    * Cache Lifecycle Manager would be responsible to moving files across Tiers. 
    * We’ll also support disabling eviction from Tier 2 if we want to keep all data as hot.
*  In Tier 1, each entry is an open refcounted IndexInput ensuring that data is in memory. For Tier 2, we can just track the files and the access patterns(frequency of use, last used/ttl, etc)
* Both Tier 1 and Tier 2 will also maintain separate stats metrics around its usage, evictions, hit rate, miss rate etc at a shard & node level.


![TieredCache Node view (1)](https://github.com/opensearch-project/OpenSearch/assets/1824124/db4a6972-15e5-4f3b-978c-09af9504674b)

![TieredCache (1)](https://github.com/opensearch-project/OpenSearch/assets/1824124/3641e9ed-e08d-49f0-a268-9896b3d4616f)

---

**User Flows:**

1. Write Flow / Start tracking a file.

  * A new file is created in local Store.
  * Files get uploaded to Remote Store. 
  * Upon upload, we invoke CompositeDirectory.afterUpload to start tracking the new files in TierMap.

2. Read a file.

* CompositeDirectory.openInput  is invoked to get a IndexInput for the file. 
    * Lookup the file in TierMap
    * If the file is present in Tier 1, return the IndexInput from FileCache (internally, it also rotates the file to the back of LRU).
    * If present in Tier 2, move to Tier1 and open the file(as a block) as memory mapped in tier 1.
        * We can explore not memory mapping the files based on context(e.g. Segment merges) to avoid disruptions in Tier 1. 
    * If present in Tier 3, move to Tier2 and subsequently to Tier 1. 
        * Tier 3 to Tier 2 will always be block files instead of complete files. 

**Movement across Tiers:**

* Tier promotion is on-demand. 
    * We also want to maintain additional metadata (i.e. working set of minimum file required locally to open the shard). These can be fetched before a shard is started. 
* Tier 1 to Tier 2 (assuming refCount = 0)
    * LRU based eviction. 
    * TTL expiry based. 
    * Available memory space based eviction (e.g. 80% of total available space).
* Tier 2 to Tier 3: 
    * TTL expiry based. 
    * Available disk space based eviction (e.g. 80% of total available space).
    * Optional setting for never evicting the files from Tier 2. 
* Tier3 lifecycle policies(not in scope)
    * Remove stale commits (older than X days/ only last X commits)

**Future Enhancements:** 

* Integration with Remote backed hot indices.
* Tier 3 lifecycle management(promotion, stale data cleanup, index deletion).
* Working Set enhancements/optimizations.
* Block Prefetching for performance improvements. e.g:
    * Always prefetch header and footer blocks.
    * Prefetch the next block (Spatial locality). 
    * File cache lifecycle based on IOContext.
    * DirectIO for merges to avoid disruptions in File Cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

Use case:

How it works

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discuss] Design for Tiered File Cache & Block Level fetch #9987

Description

Use case:

How it works

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions