-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[RFC] [Remote Store] /_remotestore/stats API and _nodes/stats API enhancements for observability on Remote Translog Store upload operations #8311
Description
Note: This RFC will be updated to incorporate feedback as received in the community discussion below
Table of Contents
- Context
- Changes in the existing
/_remotestore/statsAPI contract - Statistics to be introduced for RTS uploads
- API design
- Related information
Context
Is your feature request related to a problem? Please describe.
Aligning with #6789, we should be able to query statistics for Remote Translog Store (RTS)-related upload operations.
Describe the solution you'd like
This RFC proposes the addition of new statistics for observability on the upload flow of RTS operations. To support this, changes in the existing /_remote_store/stats API contract are also proposed.
Changes in the existing /_remotestore/stats API contract
- The stats related to Remote Segment Store (RSS) and Remote Translog Store (RTS) would be tracked under distinct, new keys named
segmentsandtranslog. - New keys named
uploadanddownloadwill be introduced under thesegmentsand thetranslogkeys. These will track the stats related to the upload and download flows respectively. Flow-agnosting stats, if any, pertaining to RSS and RTS would be introduced directly under thesegmentsandtranslogkeys respectively. - As a consequence of point 1 and point 2 above, the existing stats for RSS upload flow will be moved under the
segments.uploadlevel. New stats for the RSS download flow would be introduced under thesegments.downloadlevel. - As a consequence of point 1 and point 2 above, the new stats for RTS upload flow will be moved under the
translog.uploadlevel. New stats for the RTS download flow would be introduced under thetranslog.downloadlevel. - The RTS stats for upload flow, download flow, as well as any flow-agnosting stats would inherently be only for the primary copy of a given shard.
- The RSS upload stats would inherently be only for the primary copy of a given shard.
- The RSS download stats would have a breakdown of download stats per replica shard copy.
- As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will only be returned for the shards for which the node is the primary:
a. RSS upload flow
b. All RTS stats - As a consequence of point 5, point 6, and point 7 above, when queried at the index level on a given node, stats related to the following will be returned for all shards of the index on the node:
a. RSS download flow - If the queried index is not RTS-enabled, the
translogobject will not be returned. Only thesegmentsobject and the relevant metadata (i.e. theshard_id) will be returned.
Statistics to be introduced for RTS uploads
Visibility on local vs. RTS diff
-
lag
Represents the number of translog operations not persisted to RTS. This would be relevant for async translog durability. -
last_upload_timestamp
Represents the last successful RTS upload epoch timestamp. This wouldn’t change to the timestamp of the last RTS upload operation if the respective upload fails.
Totals
-
total_uploads
Represents the total number of RTS uploads. Eligible sub-fields (based on operation status):started,succeeded,failed. -
total_uploads_in_bytes
Represents the total number of bytes uploaded to the RTS. Eligible sub-fields (based on operation status):started,succeeded,failed. -
total_upload_time_in_millis
Represents the total time spent on RTS uploads.
Performance
-
upload_size_in_bytes
Represents the size of data to be uploaded to RTS. Eligible sub-fields:moving_avg. -
upload_speed_in_bytes_per_sec
Represents the speed of RTS uploads in bytes per second. Eligible sub-fields:moving_avg. -
upload_latency_in_millis
Represents the time taken by RTS upload. Eligible sub-fields:moving_avg.
API design
Base Path
GET /_remotestore/statsSupported path parameters
- Name of RTS-enabled index (required)
- Shard ID for RTS-enabled index (optional)
Supported query parameters
local- Retrieves stats only for the shards on the coordinating node.- Default (no parameters) - Retrieves stats for all the shards of the index across the participating nodes.
Shard-level stats for RTS-enabled index
Path:
GET /_remotestore/stats/<index>/<shardId>Response:
{
"shard_id" : "[my-index-1][0]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
},
"translog": {
<RTS flow-agnostic stats here>
"upload" : {
"lag": 2,
"last_upload_timestamp": 1687941312,
"total_uploads": {
"started": 98,
"succeeded": 96,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 246465,
"succeeded": 236647,
"failed": 0
},
"total_upload_time_in_millis": 900,
"upload_size_in_bytes": {
"moving_avg": 236.75
},
"upload_speed_in_bytes_per_sec": {
"moving_avg": 211.95
},
"upload_latency_in_millis": {
"moving_avg": 70.55
},
},
"download" : [
<new RTS download flow stats here>
],
}
}Index-level stats for RTS-enabled index
Path:
GET /_remotestore/stats/<index>Response:
{
{
"shard_id" : "[my-index-1][0]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
},
"translog": {
<RTS flow-agnostic stats here>
"upload" : {
"lag": 2,
"last_upload_timestamp": 1687941312,
"total_uploads": {
"started": 98,
"succeeded": 96,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 246465,
"succeeded": 236647,
"failed": 0
},
"total_upload_time_in_millis": 900,
"upload_size_in_bytes": {
"moving_avg": 236.75
},
"upload_speed_in_bytes_per_sec": {
"moving_avg": 211.95
},
"upload_latency_in_millis": {
"moving_avg": 70.55
},
},
"download" : [
<new RTS download flow stats here>
],
}
},
...,
{
"shard_id" : "[my-index-1][N]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
},
"translog": {
<RTS flow-agnostic stats here>
"upload" : {
"lag": 2,
"last_upload_timestamp": 1687941312,
"total_uploads": {
"started": 98,
"succeeded": 96,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 246465,
"succeeded": 236647,
"failed": 0
},
"total_upload_time_in_millis": 900,
"upload_size_in_bytes": {
"moving_avg": 236.75
},
"upload_speed_in_bytes_per_sec": {
"moving_avg": 211.95
},
"upload_latency_in_millis": {
"moving_avg": 70.55
},
},
"download" : [
<new RTS download flow stats here>
],
}
}
}Shard-level stats for RTS-disabled but RSS-enabled index
Path:
GET /_remotestore/stats/<index>/<shardId>Response:
{
"shard_id" : "[my-index-1][0]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
}
}Index-level stats for RTS-disabled but RSS-enabled index
Path:
GET /_remotestore/stats/<index>Response:
{
{
"shard_id" : "[my-index-1][0]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
}
},
...,
{
"shard_id" : "[my-index-1][N]",
"segments": {
<RSS flow-agnostic stats here>
"upload" : {
"refresh_time_lag_in_millis": 5727,
"refresh_lag": 1,
"bytes_lag": 0,
"backpressure_rejection_count": 0,
"consecutive_failure_count": 0,
"total_remote_refresh": {
"started": 57,
"succeeded": 56,
"failed": 0
},
"total_uploads_in_bytes": {
"started": 1568138701,
"succeeded": 1568138701,
"failed": 0
},
"remote_refresh_size_in_bytes": {
"last_successful": 12705142,
"moving_avg": 32766119.75
},
"upload_latency_in_bytes_per_sec": {
"moving_avg": 25523682.95
},
"remote_refresh_latency_in_millis": {
"moving_avg": 990.55
}
},
"download" : [
<new RSS download flow stats here>
]
}
}
}Related information
- [Draft] Identify stats for remote store feature #6789
- [RFC] [Remote Store] Remote Store Stats API #7153
- https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/remote-store/remote-store-stats-api/