Skip to content

[Draft] Identify stats for remote store feature #6789

@sachinpkale

Description

@sachinpkale

This is work in progress and we will keep adding more stats/metrics around remote store as we identify them.

Goal

Get visibility into remote store related operations. These stats would help in debugging an issue or monitor the cluster for potential issues. As we start ingesting data into remote store backed index, as a user, I would like to know if the segments and translog files are getting uploaded successfully to the configured remote store, if there are any failures, if the remote store is lagging etc.

Changes to existing APIs

  • Index Stats API response should provide remote_store and remote_translog stats similar to store and translog stats
  • Cat Segments API should take a query parameter to provide details of segments in remote store
  • Index Segments API should take a query parameter to provide details of segments in remote store
  • Cat Recovery API should provide details on the recovery from remote store and remote translog

New APIs

Cat Remote Store

  • Query Parameters
    • Index Name - required
    • Shard ID - optional
Remote Segment Store Stats
  1. number of segment files that are uploaded to remote segment store

    • Provides number of uploaded segments at the time of the API call
    • This metric will not consider inactive segments
  2. remote segment store lag with respect to local store

    • number of segments
      • Provides diff between number of segments on local and remote
      • This will be used to understand if remote store is in sync with local or not
    • size in bytes
      • Provides diff between size of segments on local and those uploaded to remote.
    • time in millis
      • diff between creation time of last file created on local vs max creation time of file uploaded to remote store
    • number of refresh checkpoints since the last successful upload
  3. timestamp of last successful file upload

  4. time taken to upload a segment file (total, avg, max, min, P90)

  5. time taken to delete a segment file (total, avg, max, min, P90)

  6. size of a segment file in bytes (avg, max, min, P90)

  7. total upload failures

  8. live/current upload failures

  9. total delete failures

  10. live/current delete failures

  11. total successful uploads

  12. total successful deletes

  13. time spent in remote store uploads during refresh (total, avg, max, min, P90)

Remote Translog stats
  • Mostly same as above (will add translog specific stats below)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Storage:DurabilityIssues and PRs related to the durability frameworkenhancementEnhancement or improvement to existing feature or requestv2.8.0'Issues and PRs related to version v2.8.0'

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions