Skip to content

[BUG] Agent 7.78.x: init-volume cp -r /etc/datadog-agent fails for non-root pods (regression vs 7.77.3) #50235

@jefflub-ashby

Description

@jefflub-ashby

Agent version

7.78.2

Bug Report

The chart's init-volume init container runs:

cp -r /etc/datadog-agent /opt

against the agent container's own filesystem. On 7.77.3 and earlier images, every file under /etc/datadog-agent was reachable by a non-root UID with gid 0 in supplementalGroups. On 7.78.x (verified on 7.78.2; presumably also 7.78.0 / 7.78.1) at least one file under /etc/datadog-agent (or beneath, e.g. cont-init.d/) is no longer readable by that user, so cp -r exits non-zero and every node-agent DaemonSet pod is wedged in Init:CrashLoopBackOff on init-volume.

The agent never starts, so agent status and agent logs are not available; the actual cp stderr is only reachable via kubectl logs <pod> -c init-volume --previous. We don't currently have a broken pod to exec into and have not pinpointed the exact offending path.

Ask: restore the previous filesystem-permission contract on /etc/datadog-agent (and /etc/cont-init.d) inside the agent image — every file readable by other, or by root group. If the stricter mode is intentional, please flag it as a breaking change in the release notes and update datadog/datadog so init-volume / init-config work for users running the agent as a non-root UID with gid 0 in supplementalGroups (the documented hardened deployment pattern). A regression test that boots the image with runAsUser != 0 and runs cp -r /etc/datadog-agent /opt would catch this class of issue going forward.

Reproduction Steps

  1. Install datadog/datadog with the values shown under Agent configuration, but with agents.image.tag: "7.77.3". DaemonSet rolls out cleanly.
  2. Bump only agents.image.tag to "7.78.2". No other value changes; same chart, same RBAC, same manifests.
  3. Wait for the rollout.

Expected: DaemonSet rolls out cleanly, same as on 7.77.3.

Observed: every node-agent pod stuck in Init:CrashLoopBackOff. Representative kubelet events:

BackOff: Back-off restarting failed container init-volume in pod datadog-<hash>_datadog
Pulled:  Container image "<registry>/datadog/agent:7.78.2" already present on machine

Reverting only agents.image.tag back to "7.77.3" (no other change) makes pods come up immediately.

Sanity check that isolates the failure to the non-root pod securityContext: on the same upgrade, the datadog-cluster-agent Deployment and datadog-clusterchecks Deployment use the identical 7.78.2 image and come up cleanly. They differ from the DaemonSet only in not overriding runAsUser, so they run as the chart-default agent user (uid 100), which has full access to /etc/datadog-agent.

Workaround that unsticks the rollout without bumping chart or downgrading image: override the chart's per-init-container securityContext so just init-volume and init-config run as root; the long-lived agent containers stay non-root.

agents:
  containers:
    initContainers:
      securityContext:
        runAsUser: 0
        runAsGroup: 0
        runAsNonRoot: false
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities: { drop: ["ALL"] }
        seccompProfile: { type: RuntimeDefault }

Agent configuration

Helm values applied to the broken (7.78.2) and working (7.77.3) deploys are byte-for-byte identical except for image tags. Relevant subset, with secrets externalized:

datadog:
  apiKeyExistingSecret: datadog-api-key
  appKeyExistingSecret: datadog-app-key
  clusterName: <redacted>
  checksCardinality: orchestrator
  apm:       { socketEnabled: false, portEnabled: true, port: 8126 }
  dogstatsd: { originDetection: true, tagCardinality: orchestrator, useHostPort: true }
  logs:      { enabled: true, containerCollectAll: true, autoMultiLineDetection: true }
  operator:  { enabled: false }
  # The trigger: pod-level non-root securityContext applied to all DaemonSet containers
  # (including init containers) — works on 7.77.3, breaks on 7.78.x.
  securityContext:
    runAsUser: 65534
    runAsGroup: 65534
    fsGroup: 65534
    runAsNonRoot: true
    supplementalGroups: [0, 983]   # root, systemd-journal (for /var/log/journal)
    seccompProfile: { type: RuntimeDefault }

agents:
  image: { tag: "7.78.2" }   # bug; "7.77.3" works
  useHostNetwork: true
  containers:
    agent:
      securityContext:
        capabilities: { add: ["NET_ADMIN"] }
    # initContainers.securityContext: NOT overridden — inherits the pod-level above.

clusterAgent:
  image: { tag: "7.78.2" }
  securityContext: { seccompProfile: { type: RuntimeDefault } }   # no runAsUser override → unaffected

clusterChecksRunner:
  enabled: true
  image: { tag: "7.78.2" }
  securityContext: { seccompProfile: { type: RuntimeDefault } }   # no runAsUser override → unaffected
  rbac: { dedicated: true }

Operating System

Linux (containerized agent on Amazon Linux 2023 EKS-optimized hosts)

Other environment details

Kubernetes 1.35 on EKS; datadog/datadog Helm chart 3.197.0 (unchanged across the bad upgrade); node architectures linux/amd64 (m7i) and linux/arm64 (m8g) — bug reproduces on both; Helm release managed by Terraform helm_release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    oss/0External contributions priority 0pendingLabel for issues waiting a Datadog member's response.team/agent-build

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions