Skip to content

MON-4501: Migrate Prometheus targets discovering from Endpoints to EndpointSlices#1357

Open
machine424 wants to merge 2 commits intoopenshift:masterfrom
machine424:endpoinmig
Open

MON-4501: Migrate Prometheus targets discovering from Endpoints to EndpointSlices#1357
machine424 wants to merge 2 commits intoopenshift:masterfrom
machine424:endpoinmig

Conversation

@machine424
Copy link

This PR migrates Prometheus service discovery from the deprecated Endpoints API to the EndpointSlices API, by:

  • Setting serviceDiscoveryRole: EndpointSlice on ServiceMonitors.
  • Granting Prometheus endpointslices permissions.

We're taking a conservative approach by keeping the existing endpoints permissions alongside the new endpointslices ones. This provides a safety net in case any ServiceMonitors, whether deployed from this repo or from another source, still rely on the same Role and were missed during the migration.

That said, since both resources provide essentially the same data, keeping both isn't meaningfully more permissive from a security standpoint.

These changes target OpenShift 4.22+ and should not be backported to earlier releases.

Due to the scope of changes across multiple repositories, these modifications were generated with Claude assistance.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@machine424 machine424 marked this pull request as ready for review February 9, 2026 12:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2026
@machine424 machine424 changed the title Migrate Prometheus targets discovering from Endpoints to EndpointSlices MON-4501: Migrate Prometheus targets discovering from Endpoints to EndpointSlices Feb 9, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 9, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 9, 2026

@machine424: This pull request references MON-4501 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This PR migrates Prometheus service discovery from the deprecated Endpoints API to the EndpointSlices API, by:

  • Setting serviceDiscoveryRole: EndpointSlice on ServiceMonitors.
  • Granting Prometheus endpointslices permissions.

We're taking a conservative approach by keeping the existing endpoints permissions alongside the new endpointslices ones. This provides a safety net in case any ServiceMonitors, whether deployed from this repo or from another source, still rely on the same Role and were missed during the migration.

That said, since both resources provide essentially the same data, keeping both isn't meaningfully more permissive from a security standpoint.

These changes target OpenShift 4.22+ and should not be backported to earlier releases.

Due to the scope of changes across multiple repositories, these modifications were generated with Claude assistance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jcmoraisjr
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2026
@machine424
Copy link
Author

/retest-required

@Miciah
Copy link
Contributor

Miciah commented Mar 11, 2026

I imagine you want to set serviceDiscoveryRole in desiredServiceMonitor as well:

// desiredServiceMonitor returns the desired servicemonitor for the given
// ingresscontroller and service.
func desiredServiceMonitor(ic *operatorv1.IngressController, svc *corev1.Service, deploymentRef metav1.OwnerReference) *unstructured.Unstructured {
name := controller.IngressControllerServiceMonitorName(ic)
sm := &unstructured.Unstructured{
Object: map[string]interface{}{
"metadata": map[string]interface{}{
"namespace": name.Namespace,
"name": name.Name,
},
"spec": map[string]interface{}{
"namespaceSelector": map[string]interface{}{
"matchNames": []interface{}{
operatorcontroller.DefaultOperandNamespace,
},
},
"selector": map[string]interface{}{
"matchLabels": map[string]interface{}{
manifests.OwningIngressControllerLabel: ic.Name,
},
},
// It is important to use the type []interface{}
// for the "endpoints" field. Using
// []map[string]interface{} causes at least two
// problems: first, DeepCopy will fail with
// "cannot deep copy []map[string]interface {}";
// second, the API returns an object that uses
// type []interface{} for this field, so
// DeepEqual against an API object will always
// return false.
"endpoints": []interface{}{
map[string]interface{}{
"bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token",
"interval": "30s",
"port": "metrics",
"scheme": "https",
"path": "/metrics",
"tlsConfig": map[string]interface{}{
"caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt",
"serverName": fmt.Sprintf("%s.%s.svc", svc.Name, svc.Namespace),
},
},
},
},
},
}
sm.SetGroupVersionKind(schema.GroupVersionKind{
Group: "monitoring.coreos.com",
Kind: "ServiceMonitor",
Version: "v1",
})
sm.SetOwnerReferences([]metav1.OwnerReference{deploymentRef})
return sm
}

@Miciah
Copy link
Contributor

Miciah commented Mar 11, 2026

By the way, if you don't mind, please copy the description and Jira link into the commit message.

@rikatz
Copy link
Member

rikatz commented Mar 11, 2026

/hold

There are some more changes to be done:
https://github.com/openshift/cluster-ingress-operator/tree/master/pkg/manifests/assets/router/metrics needs to be fixed, and ensured the change during upgrades (same comments from cluster-dns-operator)

func desiredServiceMonitor(ic *operatorv1.IngressController, svc *corev1.Service, deploymentRef metav1.OwnerReference) *unstructured.Unstructured {
you probably want to set the same config for endpointSlices, and fix the ensure function if required as well

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels Mar 11, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

New changes are detected. LGTM label has been removed.

@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 06ce3040-60b3-4d36-babf-1df8a9faf276

📥 Commits

Reviewing files that changed from the base of the PR and between a5951a5 and 028874d.

📒 Files selected for processing (6)
  • manifests/0000_90_ingress-operator_00_prometheusrole.yaml
  • manifests/0000_90_ingress-operator_02_servicemonitor.yaml
  • pkg/manifests/assets/router/metrics/role.yaml
  • pkg/operator/controller/ingress/metrics.go
  • pkg/operator/controller/ingress/monitoring.go
  • pkg/operator/controller/ingress/monitoring_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/manifests/assets/router/metrics/role.yaml
  • manifests/0000_90_ingress-operator_02_servicemonitor.yaml

Walkthrough

The changes add support for EndpointSlice-based service discovery by granting RBAC access to endpointslices in discovery.k8s.io, configuring the ServiceMonitor to use EndpointSlice discovery, and adding controller logic and tests to ensure and synchronize the metrics Role resource.

Changes

Cohort / File(s) Summary
RBAC EndpointSlices Access
manifests/0000_90_ingress-operator_00_prometheusrole.yaml, pkg/manifests/assets/router/metrics/role.yaml
Adds a PolicyRule permitting get, list, and watch on endpointslices in the discovery.k8s.io API group to the prometheus-k8s Role.
ServiceMonitor Configuration
manifests/0000_90_ingress-operator_02_servicemonitor.yaml
Adds serviceDiscoveryRole: EndpointSlice to the ServiceMonitor spec to opt into EndpointSlice-based service discovery.
Metrics Role Controller Logic
pkg/operator/controller/ingress/monitoring.go
Introduces ensureMetricsRole, currentMetricsRole, updateMetricsRole, and metricsRoleChanged to create/compare/update the Role resource using manifests.MetricsRole() and diff-driven updates.
Metrics Role Integration
pkg/operator/controller/ingress/metrics.go
Replaces previous get-or-create flow with a single ensureMetricsRole() call and propagates errors; retains existing RoleBinding and ServiceMonitor flows.
Metrics Role Testing
pkg/operator/controller/ingress/monitoring_test.go
Adds Test_metricsRoleChanged and extends Test_serviceMonitorChanged to cover metrics role change detection and ServiceMonitor serviceDiscoveryRole mutation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jcmoraisjr. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…rviceDiscoveryRole: EndpointSlice in ServiceMonitors

This PR migrates Prometheus service discovery from the deprecated Endpoints API to the EndpointSlices API, by:

    Setting serviceDiscoveryRole: EndpointSlice on ServiceMonitors.
    Granting Prometheus endpointslices permissions.

We're taking a conservative approach by keeping the existing endpoints permissions alongside the new endpointslices ones. This provides a safety net in case any ServiceMonitors, whether deployed from this repo or from another source, still rely on the same Role and were missed during the migration.

That said, since both resources provide essentially the same data, keeping both isn't meaningfully more permissive from a security standpoint.

These changes target OpenShift 4.22+ and should not be backported to earlier releases.

Due to the scope of changes across multiple repositories, these modifications were generated with Claude assistance.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/operator/controller/ingress/monitoring_test.go (1)

55-62: Strengthen this test to assert the expected default value, not only “changed”.

Right now, this can still pass even if desiredServiceMonitor stops setting serviceDiscoveryRole to EndpointSlice. Add an explicit baseline assertion before mutating.

Suggested test hardening
  // Verify that changing serviceDiscoveryRole is detected as a change.
  sm4 := desiredServiceMonitor(ic, svc, deploymentRef)
+ val, found, err := unstructured.NestedString(sm1.Object, "spec", "serviceDiscoveryRole")
+ if err != nil {
+   t.Fatalf("failed to read servicemonitor serviceDiscoveryRole: %v", err)
+ }
+ if !found || val != "EndpointSlice" {
+   t.Fatalf("expected serviceDiscoveryRole to be EndpointSlice, got found=%t val=%q", found, val)
+ }
  if err := unstructured.SetNestedField(sm4.Object, "Endpoints", "spec", "serviceDiscoveryRole"); err != nil {
    t.Fatalf("failed to mutate servicemonitor: %v", err)
  }

As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/ingress/monitoring_test.go` around lines 55 - 62, Add
a baseline assertion that the generated ServiceMonitor has the expected default
serviceDiscoveryRole before you mutate it: use desiredServiceMonitor(...) to
produce sm4, read the nested field (e.g., via
unstructured.NestedString(sm4.Object, "spec", "serviceDiscoveryRole")) and
assert it equals "EndpointSlice" (or the expected default) so the test fails if
defaulting stops setting that value; then proceed with the existing mutation and
serviceMonitorChanged check. Reference: desiredServiceMonitor, sm4,
serviceMonitorChanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/controller/ingress/monitoring_test.go`:
- Around line 55-62: Add a baseline assertion that the generated ServiceMonitor
has the expected default serviceDiscoveryRole before you mutate it: use
desiredServiceMonitor(...) to produce sm4, read the nested field (e.g., via
unstructured.NestedString(sm4.Object, "spec", "serviceDiscoveryRole")) and
assert it equals "EndpointSlice" (or the expected default) so the test fails if
defaulting stops setting that value; then proceed with the existing mutation and
serviceMonitorChanged check. Reference: desiredServiceMonitor, sm4,
serviceMonitorChanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2d61ccf-afdf-4dda-93a2-0b4bcc032207

📥 Commits

Reviewing files that changed from the base of the PR and between 7790136 and a5951a5.

📒 Files selected for processing (6)
  • manifests/0000_90_ingress-operator_00_prometheusrole.yaml
  • manifests/0000_90_ingress-operator_02_servicemonitor.yaml
  • pkg/manifests/assets/router/metrics/role.yaml
  • pkg/operator/controller/ingress/metrics.go
  • pkg/operator/controller/ingress/monitoring.go
  • pkg/operator/controller/ingress/monitoring_test.go

@machine424
Copy link
Author

/retest-required

@machine424
Copy link
Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2026

@machine424: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-hypershift-conformance 028874d link true /test e2e-aws-ovn-hypershift-conformance
ci/prow/e2e-azure-operator 028874d link true /test e2e-azure-operator
ci/prow/e2e-aws-ovn 028874d link true /test e2e-aws-ovn
ci/prow/e2e-hypershift 028874d link true /test e2e-hypershift
ci/prow/e2e-aws-operator 028874d link true /test e2e-aws-operator
ci/prow/e2e-gcp-operator 028874d link true /test e2e-gcp-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants