Skip to content

☂ [DEP-05] Implement EtcdOpsTask #1047

@seshachalam-yv

Description

@seshachalam-yv

How to categorize this issue?

/area quality
/area robustness
/area ops-productivity
/area backup
/area documentation
/area testing

/area monitoring
/kind enhancement
/kind api-change
/kind epic
/kind task

What would you like to be added:

This ☂ umbrella issue tracks the implementation of the EtcdOpsTask custom resource as described in DEP-05. The goal is to provide a declarative, extensible mechanism to manage operational tasks for etcd clusters in a structured and automated manner, reducing manual intervention and improving operational efficiency.


Sub-Tasks

  • Define EtcdOpsTask CRD

    • Create the EtcdOpsTask CRD with OnDemandSnapshot configuration.
    • Add CEL validation rules for immutability and required fields.
    • Update the codebase to use EtcdOpsTask consistently, removing references to EtcdOperatorTask.
    • Provide an example OnDemandSnapshot manifest in examples/.
  • Implement EtcdOpsTask Controller

    • Develop a controller to reconcile EtcdOpsTask resources, handling task admission, execution, and cleanup.
    • Implement the OnDemandSnapshot task handler to trigger snapshots on referenced etcd clusters.
    • Add unit tests for the OnDemandSnapshot task handler.
    • Add integration tests to verify controller behavior for OnDemandSnapshot.
    • Document the controller’s architecture and usage in docs/.
  • Implement E2E Tests

    • Create E2E tests to validate EtcdOpsTask workflows in a cluster environment.
    • Test scenarios: task creation, execution, status updates, error handling, and interactions with etcd clusters and backup stores.
  • Add Metrics

    • Implement Prometheus metrics for EtcdOpsTask to track task execution, status, and errors.
    • Document metrics in docs/.
  • Enhance etcd-backup-restore Server

    • Design and implement HTTP REST endpoints for etcd maintenance operations (e.g., /maintenance/defrag) in the etcd-backup-restore server, supporting synchronous and asynchronous operations, similar to /snapshot/full.
    • Ensure endpoints follow an extensible pattern for future operations.
    • Add unit and integration tests for new endpoints.
  • Documentation

    • Update docs/ with:
      • Guide on creating and managing EtcdOpsTask resources.
      • Examples for each supported task type.
      • Best practices for task configuration and monitoring.
      • Instructions for integrating with etcd-backup-restore endpoints.
  • Support Additional Task Types

    • Implement task types from DEP-05:
      • QuorumLossRecoveryTask: Automate recovery from quorum loss.
      • OnDemandMaintenanceTask: Support operations like defragmentation.
      • EtcdCopyBackupsTask: Enable copying backups between stores.
      • OnDemandSnapshotCompaction: Trigger snapshot compaction.

Why is this needed:
The introduction of EtcdOpsTask provides a declarative and extensible way to manage out-of-band operations on etcd clusters. These tasks are critical for scenarios that require manual operator intervention today. By introducing this abstraction, we aim to:

  • Improve operational productivity by automating repetitive and error-prone processes
  • Provide better observability of operator tasks
  • Enable future extensibility for additional task types

Sub-issues

Metadata

Metadata

Assignees

Labels

area/ops-productivityOperator productivity related (how to improve operations)kind/enhancementEnhancement, improvement, extension

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions