[Feature] Hybrid Compression

### Is your feature request related to a problem? Please describe

In Lucene, we have Stored Fields. The text in the such fields is stored in the index literally, in a non-inverted manner.

OpenSearch, by default, makes the `_source` field of an index as stored. The users have an option to store other fields of the document as well. These fields are directly compressed and stored in the `.fdt` file of the segment. We use [Index Codecs](https://opensearch.org/docs/latest/im-plugin/index-codecs/) to determine the compression algorithm that will be used for compress and decompress these stored fields.
The compression of these fields in the write path is dependent on [two conditions](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L228): Chunk Size and the Number of Documents.

With Hybrid Compression, we would take the compression off the write path and would store the data _**as is**_ in the segments. During merges, when the segment size have reached a certain threshold, then we would compress the segments. 

We are looking to save up on compute of compression during writes to improve latency and throughput with a trade off on disk.

### Describe the solution you'd like

**How to decide when we would perform compression?**
During the initialization of merges between segments, the Lucene estimates the size of each to be merged segment as [estimatedMergeBytes](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L4903). We will leverage this value in the SegmentInfo and use our own set thresholds to compare with estimatedMergeBytes.
In OpenSearch, we would initiate the compression once the segments have breached these thresholds. These thresholds would be dynamically configurable with an index setting.

This is [POC](https://github.com/sarthakaggarwal97/OpenSearch/commit/7bd88037d7d3c5387ae22243eeb6710f1821a7fa) implementation of the change in OpenSearch. Since we would be required to create new index codecs, we can direct this change to [custom-codecs](https://github.com/opensearch-project/custom-codecs) as well. 
Since we do not have estimatedMergeBytes in the SegmentInfos, we would see a change in [lucene](https://github.com/sarthakaggarwal97/lucene/commit/9eb2faf4404b01aefc1b5f62bd3bdebbec3999b4) as well.


**What are the cases where users would Hybrid Compression the most?**
It is expected that hybrid compression would be useful for search and update use cases, specially when the users access or update recently indexed data since we will save up compression / decompression compute. 

Benchmarks: Workload: NYC Taxis We tested three hybrid compression size thesholds: 16mb, 32mb and 64mb. The results are of 64mb (looked to be best amongst others) 
Operation | Codec | Refresh Interval | Disk Throughput Configured | Throughput | Latency | Write IOPS
-- | -- | -- | -- | -- | -- | --
Index | default | 1s | 593 mb/s | 4.50% | 5.50% | -11%
  | default | 30s | 593 mb/s | 4.20% | 5% | -220%
  | default | 30s | 250 mb/s | 6% | 14% | -220%

Operation | Codec | Refresh Interval | Disk Throughput Configured | Throughput | Latency | Read IOPS
-- | -- | -- | -- | -- | -- | --
Update | default | 1s | 593 mb/s | 3.50% | 5.50% | -12%
  | default | 30s | 593 mb/s | 6 | 8.50% | -300%
  | default | 30s | 250 mb/s | 15% | 16% | -300%

_Note: +ve means improvement, -ve means degradation from the current behaviour._

**Variance in Storage during the indexing of NYC Taxis Workload**

There are steeper dips in the storage of the disk but it quickly recovers as we reach the 64 mb segment size thresholds.

**Hybrid Compression:**
<img width="1612" alt="Screenshot 2024-03-31 at 13 18 16" src="https://github.com/opensearch-project/OpenSearch/assets/25262500/b8ac28cb-5ffb-4dcf-a340-c72613098b59">


**Default Compression**
<img width="1615" alt="Screenshot 2024-03-31 at 13 18 25" src="https://github.com/opensearch-project/OpenSearch/assets/25262500/95bab329-45af-4012-9468-924f5819c06d">


**Benchmarking Setup:**

1. 3 Dedicated Master Nodes - r6x.xlarge
2. 1 Data Node - r6x.2xlarge
3. EBS Configuration: 
 1. Storage: 1000gb
 2. IOPS: 3000
 3. Throughput: 593 mb/s and 250 mb/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Hybrid Compression #13110

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Codec	Refresh Interval	Disk Throughput Configured	Throughput	Latency	Write IOPS
Index	default	1s	593 mb/s	4.50%	5.50%	-11%
	default	30s	593 mb/s	4.20%	5%	-220%
	default	30s	250 mb/s	6%	14%	-220%

Operation	Codec	Refresh Interval	Disk Throughput Configured	Throughput	Latency	Read IOPS
Update	default	1s	593 mb/s	3.50%	5.50%	-12%
	default	30s	593 mb/s	6	8.50%	-300%
	default	30s	250 mb/s	15%	16%	-300%

[Feature] Hybrid Compression #13110

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions