Skip to content

kubelet check: enabling use_stats_summary_as_source double-reports kubernetes.cpu.usage.total / kubernetes.memory.* (sum is ~2x) #50544

@a7i

Description

@a7i

Agent version

7.78.2

Bug Report

Description

After setting use_stats_summary_as_source: true in the kubelet check, sum aggregations of container resource utilization metrics roughly double:

  • kubernetes.cpu.usage.total
  • kubernetes.memory.usage
  • kubernetes.memory.working_set

The change is visible only on sum: aggregations, which is consistent with the same container series being emitted by both the cAdvisor source (/metrics/cadvisor) and the Summary API source (/stats/summary) at the same time, instead of Summary replacing cAdvisor. Per-series values look unchanged; the total roughly doubles.
Image

This is not a real load increase. container.cpu.usage (collected directly from cgroups, independent of the kubelet endpoints) is flat across the same change window, while sum:kubernetes.cpu.usage.total steps up ~2x at deploy time. See attached screenshots.
Image

Environment

  • Datadog Agent: 7.78.2
  • Kubernetes: EKS
  • Container runtime: containerd, runc workloads (no gVisor on the affected nodes)
  • Setting that triggers the change: use_stats_summary_as_source: true in the kubelet check

Expected behavior

When use_stats_summary_as_source: true, the kubelet check should emit kubernetes.cpu.* and kubernetes.memory.* for each container from a single source (Summary), so existing dashboards / monitors using sum: remain correct.

Actual behavior

sum:kubernetes.cpu.usage.total and sum:kubernetes.memory.working_set step up by roughly 2x at the moment the flag is enabled. Cross-checks:

  • sum:container.cpu.usage (cgroup-direct) is unchanged, so host load did not change.
  • system.cpu.* shows no corresponding jump.

Pattern is consistent with cAdvisor-derived and Summary-derived series being emitted concurrently for the same containers.

Why other metric families are unaffected

Metric family Collector Source Affected by use_stats_summary_as_source?
system.cpu.* system check host kernel stats No
container.cpu.*, container.memory.* container check cgroups + runtime No
kubernetes.cpu.*, kubernetes.memory.* kubelet check /metrics/cadvisor and/or /stats/summary Yes, and currently both, hence the 2x

Two independent ground-truth sources (system.* and container.*) stay flat while the kubelet-derived family doubles, which points at double counting inside the kubelet check rather than a workload change.

Screenshots

(attached below) Step-up in sum:kubernetes.cpu.usage.total and sum:kubernetes.memory.working_set at the time use_stats_summary_as_source: true was rolled out, alongside sum:container.cpu.usage flat across the same window.

Workaround

No workaround for runsc workloads.

For standard runc workloads, reverting use_stats_summary_as_source to its default (false) avoids the duplication.

However, we originally enabled use_stats_summary_as_source: true because we need the kubelet Summary (CRI) source for sandboxed workloads such as gVisor (runsc), where cAdvisor does not report per-container resource usage at all (see #44084). For those workloads we have no workaround: turning the flag off loses the metrics entirely, while turning it on double-counts every other container on the node.

Agent configuration

init_config:
  loader: core
instances:
  - kubelet_metrics_endpoint: https://localhost:10250/metrics
    use_stats_summary_as_source: true
    min_collection_interval: 20

Operating System

Linux (EKS nodes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    oss/0External contributions priority 0pendingLabel for issues waiting a Datadog member's response.team/container-integrations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions