-
Notifications
You must be signed in to change notification settings - Fork 64
Description
How to categorize this issue?
/area quality
/area robustness
/area ops-productivity
/area backup
/area documentation
/area testing
/area monitoring
/kind enhancement
/kind api-change
/kind epic
/kind task
What would you like to be added:
This ☂ umbrella issue tracks the implementation of the EtcdOpsTask custom resource as described in DEP-05. The goal is to provide a declarative, extensible mechanism to manage operational tasks for etcd clusters in a structured and automated manner, reducing manual intervention and improving operational efficiency.
Sub-Tasks
-
Define
EtcdOpsTaskCRD- Create the
EtcdOpsTaskCRD withOnDemandSnapshotconfiguration. - Add CEL validation rules for immutability and required fields.
- Update the codebase to use
EtcdOpsTaskconsistently, removing references toEtcdOperatorTask. - Provide an example
OnDemandSnapshotmanifest inexamples/.
- Create the
-
Implement
EtcdOpsTaskController- Develop a controller to reconcile
EtcdOpsTaskresources, handling task admission, execution, and cleanup. - Implement the
OnDemandSnapshottask handler to trigger snapshots on referenced etcd clusters. - Add unit tests for the
OnDemandSnapshottask handler. - Add integration tests to verify controller behavior for
OnDemandSnapshot. - Document the controller’s architecture and usage in
docs/.
- Develop a controller to reconcile
-
Implement E2E Tests
- Create E2E tests to validate
EtcdOpsTaskworkflows in a cluster environment. - Test scenarios: task creation, execution, status updates, error handling, and interactions with etcd clusters and backup stores.
- Create E2E tests to validate
-
Add Metrics
- Implement Prometheus metrics for
EtcdOpsTaskto track task execution, status, and errors. - Document metrics in
docs/.
- Implement Prometheus metrics for
-
Enhance
etcd-backup-restoreServer- Design and implement HTTP REST endpoints for etcd maintenance operations (e.g.,
/maintenance/defrag) in the etcd-backup-restore server, supporting synchronous and asynchronous operations, similar to/snapshot/full. - Ensure endpoints follow an extensible pattern for future operations.
- Add unit and integration tests for new endpoints.
- Design and implement HTTP REST endpoints for etcd maintenance operations (e.g.,
-
Documentation
- Update
docs/with:- Guide on creating and managing
EtcdOpsTaskresources. - Examples for each supported task type.
- Best practices for task configuration and monitoring.
- Instructions for integrating with etcd-backup-restore endpoints.
- Guide on creating and managing
- Update
-
Support Additional Task Types
- Implement task types from DEP-05:
-
QuorumLossRecoveryTask: Automate recovery from quorum loss. -
OnDemandMaintenanceTask: Support operations like defragmentation. -
EtcdCopyBackupsTask: Enable copying backups between stores. -
OnDemandSnapshotCompaction: Trigger snapshot compaction.
-
- Implement task types from DEP-05:
Why is this needed:
The introduction of EtcdOpsTask provides a declarative and extensible way to manage out-of-band operations on etcd clusters. These tasks are critical for scenarios that require manual operator intervention today. By introducing this abstraction, we aim to:
- Improve operational productivity by automating repetitive and error-prone processes
- Provide better observability of operator tasks
- Enable future extensibility for additional task types