OpenSearch Segment Replication [RFC]

# RFC - OpenSearch Segment Replication

## Overview

The purpose of this document is to propose a new shard replication strategy (segment replication) and outline its tradeoffs with the existing approach (document replication). This is not a design document; it is a high level proposal to garner feedback from the OpenSearch community.  It is also pertinent to note that the aim of this change is not to replace document replication with segment replication. 

This RFC is a part of ongoing changes to revamp OpenSearch’s storage and data retrieval architecture. Subsequently, we would also like to cover updates to the transaction log (translog) and storage abstraction. As always, if you have ideas on how to improve these components or any others, we welcome your RFCs. As such, this RFC will only focus on replication strategies.  

## Document Replication <a href="#user-content-doc-rep" id="doc-rep">#</a>

All operations that affect an OpenSearch index (i.e., adding, updating, or removing documents) are routed to one of the index’s primary shards. The primary shard is then responsible for validating the operation and subsequently executing it locally. Once the operation has been completed successfully on the primary shard, the operation is forwarded to each of its replica shards in parallel. Each replica shard executes the operation, duplicating the processing performed on the primary shard. We refer to this replication strategy as Document Replication.

When the operation has completed (either successfully or with a failure) on every replica and a response has been received by the primary shard, the primary shard then responds to the client. This response includes information about replication success/failure to replica shards.

## Segment Replication

In segment replication, we copy Lucene segment files from the primary shard to its replicas instead of having replicas re-execute the operation. Since Lucene uses a write-once segmented architecture (see FAQ # 1), only new segment files will need to be copied, since the existing ones will never change. This mechanism will adapt the NRT replication capabilities [introduced in Lucene 6.0](https://issues.apache.org/jira/browse/LUCENE-5438). We can also ensure that segment merging (see FAQ # 2) is only performed by the primary shard. Merged segments can then be copied over to replicas through Lucene’s warming APIs.

Note that this does not remove the need for a transaction log (see FAQ # 3) since segment replication does not force a commit to disk. We will retain the existing translog implementation, though this may present some caveats (see FAQ # 4). As noted in the overview, updates to the translog architecture will be covered by an upcoming RFC.

 ## Feedback Requested (examples):

1. Are there use cases for which segment replications cannot work?  Or alternately, do you have use cases where segment replication is the ideal solution?
2. What consistency and reliability properties would you like to see?
3. How would you prefer to configure replication strategy? Per-index or on a cluster as a whole?
4. Does point-in-time consistency matter to you? What use-cases benefit from this? What are the tradeoffs of using segment replication but having each shard perform segment merging independently?
5. Are there any areas that are not captured by the tradeoffs discussed below?

## Tradeoffs

### Performance

In Document Replication, all replicas must do the same redundant processing as the primary node. This can present a bottleneck, especially when searching runs concurrently with intensive background operations such as segment merging. This is exacerbated on clusters that need many replicas to support high query rates.

Segment Replication avoids this redundant processing since replicas do not need to execute the operation, reducing CPU and memory utilization and improving performance. As an illustrative example, [Yelp reported a 30% improvement in throughput](https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html) after moving to a segment replication based approach. 

That said, this improvement in throughput may come at the cost of increased network bandwidth usage. This is because in Segment Replication, we need to copy the segment files from the primary node to all replicas over the wire in addition to managing the translog. Merged segments will also need to be copied from the primary node to all replicas. This presents an increase in data sent over the network between shards compared to Document Replication, where only the original request is forwarded from the primary to replicas. Each replica is then responsible for its own operation processing and segment merging, with no further interaction between shards across the network.

In Segment Replication, there may also be an increased refresh latency i.e. delay before documents are visible for searching. This is because segment files must first be copied to replicas before the replicas can open new searchers.

As part of the design, we will validate these hypotheses and explore ways to offset any additional increase in bandwith.

### Recovery

Recovery using Document Replication can be slow. This is because when a replica goes down for a while and then comes back up, it must replay (reindex) all newly indexed documents that arrived while it was down. This can easily be a very large number, thus taking a long time to catch up, or it must fallback to a costly full index copy. In comparison, Segment Replication typically only needs to copy the new index files. 

### Failover

Failover (promotion of a replica to a primary) in Segment Replication can be slower than Document Replication in the worst-case scenario i.e. when the primary shard fails before segments have been copied over to replicas. In this situation, the replicas will need to fall back to replaying operations from the translog.

### Shard Allocation

With Segment Replication, primary nodes perform more work than the replicas. The shard allocation algorithm would need to change to ensure that the cluster does not become imbalanced. On failover, replica nodes may suddenly do more work (if the translog were replayed), causing hotspots. In contrast, Document Replication does not have such complications since all shards perform an identical amount of work, though redundant.

### Consistency

Both replication strategies prioritize consistency over availability. Document Replication is more susceptible to availability drops due to transient errors when a replica operation throws an exception after the primary shard has successfully processed the operation. However, Segment Replicaiton will require more guardrails to ensure file validity when segment files are copied over. This is because only the primary shard performs indexing and merge operations. If the primary somehow writes a broken file, this would be copied to all replicas. Further, replicas will need mechanisms to verify the consistency of segment files transferred over the network.

Segment Replication also offers point-in-time consistency since it ensures that Lucene indexes ((i.e. OpenSearch shards) are identical throughout a replication group (primary shard and all its replicas) . This ensures that performance across shards remains consistent. In Document Replication, the primary and replicas operate on and merge segment files independently, so they may be searching slightly different and incomparable views of the index (though the data contained within is identical).

## Next steps

Based on the feedback from this RFC, we’ll be drafting a design (link to be added).  We look forward to collaborating with the community.

## FAQs

*1. How is OpenSearch Indexing structured?*
An OpenSearch index is comprised of multiple shards, each of which is a Lucene index. Each Lucene index is, in turn, made up of one or more immutable index segments. Each segment is essentially a "mini-index" in itself. When you perform a search on OpenSearch, each shard uses Lucene to search across its segments, filter out any deletions, and merge the results from all segments. 

*2. What is segment merging?*
Due to Lucene's "write-once" semantics, more and more segment files are created as more and more data is indexed. Also, deleting a document only marks the document as deleted and does not immediately free up the space it occupies. Searching through a large number of segments is inefficient because Lucene must search them in sequence, not in parallel. Thus, from time to time, Lucene automatically merges smaller segments into larger ones to keep search efficient, and to permanently remove deleted documents. Merging is a performance intensive operation.

*3. What is the translog (transaction log)?*
OpenSearch uses a Transaction Log (a.k.a translog) to store all operations that have been performed on the shard but have not yet been committed to the file system. Each shard, whether primary or replica, has its own translog, which is committed to disk every time an operation is recorded. 

An OpenSearch flush is the process of performing a Lucene commit and starting a new translog generation. In the event of a crash, recent operations that are not yet included in the last Lucene commit are replayed from the translog when the shard recovers to bring it up to date.

*4. Is there a caveat of keeping the existing translog implementation when Segment Replication is in use?*
Yes. Since we now perform two separate operations on replicas (segment replication and translog updates), we cannot guarantee the ordering of these operations together. Thus, there is a possibility that the translog and the segment files on a replica will be out of sync with each other.

Note that with segment replication, there is no need for a translog on replica shards since they receive segment files directly from their primary shards. However, there is still a need for a translog per replication group in case the primary shard becomes unavailable. Strategies to address this will be covered by an upcoming RFC.

*5. Will it be possible to change an existing index’s replication strategy?*
No, it is not possible to switch an existing index’s replication strategy between Document Replication and Segment Replication. Segment Replication assumes that all shards in a replication group contain identical segment files. This is not a requirement for document replication since each shard creates segments and merges independently.

*References*

* https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html
* https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html
* https://fdv.github.io/running-elasticsearch-fun-profit/003-about-lucene/003-about-lucene.html



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSearch Segment Replication [RFC] #1694

RFC - OpenSearch Segment Replication

Overview

Document Replication #

Segment Replication

Feedback Requested (examples):

Tradeoffs

Performance

Recovery

Failover

Shard Allocation

Consistency

Next steps

FAQs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenSearch Segment Replication [RFC] #1694

Description

RFC - OpenSearch Segment Replication

Overview

Document Replication #

Segment Replication

Feedback Requested (examples):

Tradeoffs

Performance

Recovery

Failover

Shard Allocation

Consistency

Next steps

FAQs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions