Agent version
7.78.2
Bug Report
The chart's init-volume init container runs:
cp -r /etc/datadog-agent /opt
against the agent container's own filesystem. On 7.77.3 and earlier images, every file under /etc/datadog-agent was reachable by a non-root UID with gid 0 in supplementalGroups. On 7.78.x (verified on 7.78.2; presumably also 7.78.0 / 7.78.1) at least one file under /etc/datadog-agent (or beneath, e.g. cont-init.d/) is no longer readable by that user, so cp -r exits non-zero and every node-agent DaemonSet pod is wedged in Init:CrashLoopBackOff on init-volume.
The agent never starts, so agent status and agent logs are not available; the actual cp stderr is only reachable via kubectl logs <pod> -c init-volume --previous. We don't currently have a broken pod to exec into and have not pinpointed the exact offending path.
Ask: restore the previous filesystem-permission contract on /etc/datadog-agent (and /etc/cont-init.d) inside the agent image — every file readable by other, or by root group. If the stricter mode is intentional, please flag it as a breaking change in the release notes and update datadog/datadog so init-volume / init-config work for users running the agent as a non-root UID with gid 0 in supplementalGroups (the documented hardened deployment pattern). A regression test that boots the image with runAsUser != 0 and runs cp -r /etc/datadog-agent /opt would catch this class of issue going forward.
Reproduction Steps
- Install
datadog/datadog with the values shown under Agent configuration, but with agents.image.tag: "7.77.3". DaemonSet rolls out cleanly.
- Bump only
agents.image.tag to "7.78.2". No other value changes; same chart, same RBAC, same manifests.
- Wait for the rollout.
Expected: DaemonSet rolls out cleanly, same as on 7.77.3.
Observed: every node-agent pod stuck in Init:CrashLoopBackOff. Representative kubelet events:
BackOff: Back-off restarting failed container init-volume in pod datadog-<hash>_datadog
Pulled: Container image "<registry>/datadog/agent:7.78.2" already present on machine
Reverting only agents.image.tag back to "7.77.3" (no other change) makes pods come up immediately.
Sanity check that isolates the failure to the non-root pod securityContext: on the same upgrade, the datadog-cluster-agent Deployment and datadog-clusterchecks Deployment use the identical 7.78.2 image and come up cleanly. They differ from the DaemonSet only in not overriding runAsUser, so they run as the chart-default agent user (uid 100), which has full access to /etc/datadog-agent.
Workaround that unsticks the rollout without bumping chart or downgrading image: override the chart's per-init-container securityContext so just init-volume and init-config run as root; the long-lived agent containers stay non-root.
agents:
containers:
initContainers:
securityContext:
runAsUser: 0
runAsGroup: 0
runAsNonRoot: false
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
seccompProfile: { type: RuntimeDefault }
Agent configuration
Helm values applied to the broken (7.78.2) and working (7.77.3) deploys are byte-for-byte identical except for image tags. Relevant subset, with secrets externalized:
datadog:
apiKeyExistingSecret: datadog-api-key
appKeyExistingSecret: datadog-app-key
clusterName: <redacted>
checksCardinality: orchestrator
apm: { socketEnabled: false, portEnabled: true, port: 8126 }
dogstatsd: { originDetection: true, tagCardinality: orchestrator, useHostPort: true }
logs: { enabled: true, containerCollectAll: true, autoMultiLineDetection: true }
operator: { enabled: false }
# The trigger: pod-level non-root securityContext applied to all DaemonSet containers
# (including init containers) — works on 7.77.3, breaks on 7.78.x.
securityContext:
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
runAsNonRoot: true
supplementalGroups: [0, 983] # root, systemd-journal (for /var/log/journal)
seccompProfile: { type: RuntimeDefault }
agents:
image: { tag: "7.78.2" } # bug; "7.77.3" works
useHostNetwork: true
containers:
agent:
securityContext:
capabilities: { add: ["NET_ADMIN"] }
# initContainers.securityContext: NOT overridden — inherits the pod-level above.
clusterAgent:
image: { tag: "7.78.2" }
securityContext: { seccompProfile: { type: RuntimeDefault } } # no runAsUser override → unaffected
clusterChecksRunner:
enabled: true
image: { tag: "7.78.2" }
securityContext: { seccompProfile: { type: RuntimeDefault } } # no runAsUser override → unaffected
rbac: { dedicated: true }
Operating System
Linux (containerized agent on Amazon Linux 2023 EKS-optimized hosts)
Other environment details
Kubernetes 1.35 on EKS; datadog/datadog Helm chart 3.197.0 (unchanged across the bad upgrade); node architectures linux/amd64 (m7i) and linux/arm64 (m8g) — bug reproduces on both; Helm release managed by Terraform helm_release.
Agent version
7.78.2
Bug Report
The chart's
init-volumeinit container runs:against the agent container's own filesystem. On
7.77.3and earlier images, every file under/etc/datadog-agentwas reachable by a non-root UID withgid 0insupplementalGroups. On7.78.x(verified on7.78.2; presumably also7.78.0/7.78.1) at least one file under/etc/datadog-agent(or beneath, e.g.cont-init.d/) is no longer readable by that user, socp -rexits non-zero and every node-agent DaemonSet pod is wedged inInit:CrashLoopBackOffoninit-volume.The agent never starts, so
agent statusand agent logs are not available; the actualcpstderr is only reachable viakubectl logs <pod> -c init-volume --previous. We don't currently have a broken pod to exec into and have not pinpointed the exact offending path.Ask: restore the previous filesystem-permission contract on
/etc/datadog-agent(and/etc/cont-init.d) inside the agent image — every file readable byother, or byrootgroup. If the stricter mode is intentional, please flag it as a breaking change in the release notes and updatedatadog/datadogsoinit-volume/init-configwork for users running the agent as a non-root UID withgid 0insupplementalGroups(the documented hardened deployment pattern). A regression test that boots the image withrunAsUser != 0and runscp -r /etc/datadog-agent /optwould catch this class of issue going forward.Reproduction Steps
datadog/datadogwith the values shown under Agent configuration, but withagents.image.tag: "7.77.3". DaemonSet rolls out cleanly.agents.image.tagto"7.78.2". No other value changes; same chart, same RBAC, same manifests.Expected: DaemonSet rolls out cleanly, same as on
7.77.3.Observed: every node-agent pod stuck in
Init:CrashLoopBackOff. Representative kubelet events:Reverting only
agents.image.tagback to"7.77.3"(no other change) makes pods come up immediately.Sanity check that isolates the failure to the non-root pod
securityContext: on the same upgrade, thedatadog-cluster-agentDeployment anddatadog-clusterchecksDeployment use the identical7.78.2image and come up cleanly. They differ from the DaemonSet only in not overridingrunAsUser, so they run as the chart-default agent user (uid 100), which has full access to/etc/datadog-agent.Workaround that unsticks the rollout without bumping chart or downgrading image: override the chart's per-init-container
securityContextso justinit-volumeandinit-configrun as root; the long-lived agent containers stay non-root.Agent configuration
Helm values applied to the broken (
7.78.2) and working (7.77.3) deploys are byte-for-byte identical except for image tags. Relevant subset, with secrets externalized:Operating System
Linux (containerized agent on Amazon Linux 2023 EKS-optimized hosts)
Other environment details
Kubernetes 1.35 on EKS; datadog/datadog Helm chart 3.197.0 (unchanged across the bad upgrade); node architectures linux/amd64 (m7i) and linux/arm64 (m8g) — bug reproduces on both; Helm release managed by Terraform helm_release.