Skip to content

[RFC] [Remote Store] /_remotestore/stats API and _nodes/stats API enhancements for observability on Remote Translog Store upload operations #8311

@BhumikaSaini-Amazon

Description

@BhumikaSaini-Amazon

Note: This RFC will be updated to incorporate feedback as received in the community discussion below

Table of Contents


Context

Is your feature request related to a problem? Please describe.

Aligning with #6789, we should be able to query statistics for Remote Translog Store (RTS)-related upload operations.

Describe the solution you'd like
This RFC proposes the addition of new statistics for observability on the upload flow of RTS operations. To support this, changes in the existing /_remote_store/stats API contract are also proposed.


Changes in the existing /_remotestore/stats API contract

  1. The stats related to Remote Segment Store (RSS) and Remote Translog Store (RTS) would be tracked under distinct, new keys named segments and translog.
  2. New keys named upload and download will be introduced under the segments and the translog keys. These will track the stats related to the upload and download flows respectively. Flow-agnosting stats, if any, pertaining to RSS and RTS would be introduced directly under the segments and translog keys respectively.
  3. As a consequence of point 1 and point 2 above, the existing stats for RSS upload flow will be moved under the segments.upload level. New stats for the RSS download flow would be introduced under the segments.download level.
  4. As a consequence of point 1 and point 2 above, the new stats for RTS upload flow will be moved under the translog.upload level. New stats for the RTS download flow would be introduced under the translog.download level.
  5. The RTS stats for upload flow, download flow, as well as any flow-agnosting stats would inherently be only for the primary copy of a given shard.
  6. The RSS upload stats would inherently be only for the primary copy of a given shard.
  7. The RSS download stats would have a breakdown of download stats per replica shard copy.
  8. As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will only be returned for the shards for which the node is the primary:
    a. RSS upload flow
    b. All RTS stats
  9. As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will be returned for all shards of the index on the node:
    a. RSS download flow
  10. If the queried index is not RTS-enabled, the translog object will not be returned. Only the segments object and the relevant metadata (i.e. the shard_id) will be returned.

Statistics to be introduced for RTS uploads

Visibility on local vs. RTS diff

  1. lag
    Represents the number of translog operations not persisted to RTS. This would be relevant for async translog durability.

  2. last_upload_timestamp
    Represents the last successful RTS upload epoch timestamp. This wouldn’t change to the timestamp of the last RTS upload operation if the respective upload fails.

Totals

  1. total_uploads
    Represents the total number of RTS uploads. Eligible sub-fields (based on operation status): started, succeeded, failed.

  2. total_uploads_in_bytes
    Represents the total number of bytes uploaded to the RTS. Eligible sub-fields (based on operation status): started, succeeded, failed.

  3. total_upload_time_in_millis
    Represents the total time spent on RTS uploads.

Performance

  1. upload_size_in_bytes
    Represents the size of data to be uploaded to RTS. Eligible sub-fields: moving_avg.

  2. upload_speed_in_bytes_per_sec
    Represents the speed of RTS uploads in bytes per second. Eligible sub-fields: moving_avg.

  3. upload_latency_in_millis
    Represents the time taken by RTS upload. Eligible sub-fields: moving_avg.


API design

Base Path

GET /_remotestore/stats

Supported path parameters

  1. Name of RTS-enabled index (required)
  2. Shard ID for RTS-enabled index (optional)

Supported query parameters

  1. local - Retrieves stats only for the shards on the coordinating node.
  2. Default (no parameters) - Retrieves stats for all the shards of the index across the participating nodes.

Shard-level stats for RTS-enabled index

Path:

GET /_remotestore/stats/<index>/<shardId>

Response:

{
    "shard_id" : "[my-index-1][0]",
    "segments": {
       <RSS flow-agnostic stats here>
       "upload" : {
            "refresh_time_lag_in_millis": 5727,
            "refresh_lag": 1,
            "bytes_lag": 0,
            "backpressure_rejection_count": 0,
            "consecutive_failure_count": 0,
            "total_remote_refresh": {
                "started": 57,
                "succeeded": 56,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 1568138701,
                "succeeded": 1568138701,
                "failed": 0
            },
            "remote_refresh_size_in_bytes": {
                "last_successful": 12705142,
                "moving_avg": 32766119.75
            },
            "upload_latency_in_bytes_per_sec": {
                "moving_avg": 25523682.95
            },
            "remote_refresh_latency_in_millis": {
                "moving_avg": 990.55
            }
        },
       "download" : [
            <new RSS download flow stats here>
        ]
    },
    "translog": {
       <RTS flow-agnostic stats here>
       "upload" : {
            "lag": 2,
            "last_upload_timestamp": 1687941312,
            "total_uploads": {
                "started": 98,
                "succeeded": 96,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 246465,
                "succeeded": 236647,
                "failed": 0
            },
            "total_upload_time_in_millis": 900,
            "upload_size_in_bytes": {
                "moving_avg": 236.75
            },
            "upload_speed_in_bytes_per_sec": {
                "moving_avg": 211.95
            },
            "upload_latency_in_millis": {
                "moving_avg": 70.55
            },
        },
       "download" : [
            <new RTS download flow stats here>
        ],
    }
}

Index-level stats for RTS-enabled index

Path:

GET /_remotestore/stats/<index>

Response:

{
    {
        "shard_id" : "[my-index-1][0]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        },
        "translog": {
            <RTS flow-agnostic stats here>
            "upload" : {
                "lag": 2,
                "last_upload_timestamp": 1687941312,
                "total_uploads": {
                    "started": 98,
                    "succeeded": 96,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 246465,
                    "succeeded": 236647,
                    "failed": 0
                },
                "total_upload_time_in_millis": 900,
                "upload_size_in_bytes": {
                    "moving_avg": 236.75
                },
                "upload_speed_in_bytes_per_sec": {
                    "moving_avg": 211.95
                },
                "upload_latency_in_millis": {
                    "moving_avg": 70.55
                },
            },
            "download" : [
                <new RTS download flow stats here>
            ],
        }
    },
    
    ...,
    
    {
        "shard_id" : "[my-index-1][N]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        },
        "translog": {
            <RTS flow-agnostic stats here>
            "upload" : {
                "lag": 2,
                "last_upload_timestamp": 1687941312,
                "total_uploads": {
                    "started": 98,
                    "succeeded": 96,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 246465,
                    "succeeded": 236647,
                    "failed": 0
                },
                "total_upload_time_in_millis": 900,
                "upload_size_in_bytes": {
                    "moving_avg": 236.75
                },
                "upload_speed_in_bytes_per_sec": {
                    "moving_avg": 211.95
                },
                "upload_latency_in_millis": {
                    "moving_avg": 70.55
                },
            },
            "download" : [
                <new RTS download flow stats here>
            ],
        }
    }
}

Shard-level stats for RTS-disabled but RSS-enabled index

Path:

GET /_remotestore/stats/<index>/<shardId>

Response:

{
    "shard_id" : "[my-index-1][0]",
    "segments": {
       <RSS flow-agnostic stats here>
       "upload" : {
            "refresh_time_lag_in_millis": 5727,
            "refresh_lag": 1,
            "bytes_lag": 0,
            "backpressure_rejection_count": 0,
            "consecutive_failure_count": 0,
            "total_remote_refresh": {
                "started": 57,
                "succeeded": 56,
                "failed": 0
            },
            "total_uploads_in_bytes": {
                "started": 1568138701,
                "succeeded": 1568138701,
                "failed": 0
            },
            "remote_refresh_size_in_bytes": {
                "last_successful": 12705142,
                "moving_avg": 32766119.75
            },
            "upload_latency_in_bytes_per_sec": {
                "moving_avg": 25523682.95
            },
            "remote_refresh_latency_in_millis": {
                "moving_avg": 990.55
            }
        },
       "download" : [
            <new RSS download flow stats here>
        ]
    }
}

Index-level stats for RTS-disabled but RSS-enabled index

Path:

GET /_remotestore/stats/<index>

Response:

{
    {
        "shard_id" : "[my-index-1][0]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        }
    },
    
    ...,
    
    {
        "shard_id" : "[my-index-1][N]",
        "segments": {
            <RSS flow-agnostic stats here>
            "upload" : {
                "refresh_time_lag_in_millis": 5727,
                "refresh_lag": 1,
                "bytes_lag": 0,
                "backpressure_rejection_count": 0,
                "consecutive_failure_count": 0,
                "total_remote_refresh": {
                    "started": 57,
                    "succeeded": 56,
                    "failed": 0
                },
                "total_uploads_in_bytes": {
                    "started": 1568138701,
                    "succeeded": 1568138701,
                    "failed": 0
                },
                "remote_refresh_size_in_bytes": {
                    "last_successful": 12705142,
                    "moving_avg": 32766119.75
                },
                "upload_latency_in_bytes_per_sec": {
                    "moving_avg": 25523682.95
                },
                "remote_refresh_latency_in_millis": {
                    "moving_avg": 990.55
                }
            },
            "download" : [
                <new RSS download flow stats here>
            ]
        }
    }
}

Related information

  1. [Draft] Identify stats for remote store feature #6789
  2. [RFC] [Remote Store] Remote Store Stats API #7153
  3. https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/

Metadata

Metadata

Assignees

No one assigned

    Labels

    StorageIssues and PRs relating to data and metadata storageStorage:DurabilityIssues and PRs related to the durability frameworkenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions