Skip to content

[FEATURE] Use Pluggable translog for fetching the operations from leader  #375

@saikaranam-amazon

Description

@saikaranam-amazon

What are you proposing?
We're working on utilising Segment Replication for CCR and plan to make that as a default choice for replication. However, we aren't planning for deprecating the logical CCR as of now.

To support logical replication in the long run, we propose relying on Pluggable translog for fetching the operations for CCR(logical). More details here
Here are the key points:

  • We'll rely on Translog Manager to provide the changes to be replayed on the follower.
  • We'll fetch the changes only from primary shard of the leader(except when leader is using logical replication).
  • For case where a user opts for no durability, we'll not support replication.
  • For cases where leader cluster has remote translog configured, we'll fetch the translog directly from the remote store.

Why support logical replication?

  • Incase customers don't want to use Segment based replication.
  • Due to potentially high NW usage with Segment Replication for cross region cases.
  • Incase, we want to support Active-Active replication.
  • Support for replication across different OS versions.
Local replication on leader Translogs Plans to be supported in near future Source
Logical Local yes primary and replica
Logical Remote no primary
Logical No-op no can fetch from lucene
Segment Local yes primary
Segment Remote yes primary's remote
Segment No-op yes can't fetch

How did you come up with this proposal?
Follow up from opensearch-project/OpenSearch#1100.


What is the user experience going to be?
Cross-cluster replication (CCR) simplifies the process of copying data between multiple clusters. Users can use CCR to enable a remote cluster for the purpose of Disaster Recovery or for data proximity.

Currently, CCR leverages logical replication to copy data from leader to the follower index. For which it fetches the operations on the leader index from the translog.

Benefits of doing this change
While, the proposed feature does not fundamentally change the experience of CCR. It adheres to better design principles and best practices which will ensure compatibility with future engine changes. By moving the fetching operation to the pluggable translog it provides the opportunity to develop - active-active replication, replication between incompatible OpenSearch versions, and upgrades of leader or follower index without breaking ongoing replication.

The ability to fetch operations directly from the translog manager allows us to make CCR agnostic from the replication mechanism used inside the cluster as we also add support for segment replication.
We'll also build support on top of it for remote translog.


Why should it be built? Any reason not to?
This needs to be built so that we can keep supporting logical CCR 3.0 onwards
Only reason not to support this would be if we want to solely rely on Segment replication for CCR.

What will it take to execute?
Changes done in the PR should take solve the problem for now.
In future, when we start relying on remote translogs, we'll need to add support for fetching the operations from leader shard's remote store directly.

What are remaining open questions?
N/A


Is your feature request related to a problem?
Provide extension point for Tlog fetch operations under OS Engine:

  • CCR needs these operations to address performance issues as detailed in - Translog pruning based on retention leases OpenSearch#1100. From 2.x, peer recovery moved to Lucene based on soft-deletes and Tlog fetch operations are deprecated. This issue tracks the tasks to expose extension point in core and use under CCR.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions