Runc breaks pods when CPU requests aren't multiples of 10

2 hours ago 1

Description

When using the systemd cgroup driver with a CPU limit of 4096m, pod creation fails intermittently because containerd non-deterministically calculates either 409600 or 410000 microseconds for the parent cgroup, while runc consistently calculates 410000 for child cgroups. When they mismatch, the Linux kernel rejects the child cgroup creation with "invalid argument".

Root Cause

Investigation reveals non-deterministic behavior in containerd when converting 4096m to microseconds:

Containerd (when creating pod sandbox) - INCONSISTENT:
- Sometimes calculates: 4096m → 409600 microseconds (correct: 4096 / 1000 * 100000)
- Sometimes calculates: 4096m → 410000 microseconds (rounded: 4.1 * 100000)
- Sets parent cgroup: cpu.cfs_quota_us to whichever value it calculated
runc (when creating application container) - CONSISTENT:
- Always calculates: 4096m → 410000 microseconds (appears to round 4.096 to 4.1)
- Tries to set child cgroup: cpu.cfs_quota_us = 410000
Result:
- When containerd picks 410000: Parent = 410000, child = 410000 → Success!
- When containerd picks 409600: Parent = 409600, child = 410000 → Kernel rejects! (child > parent)
- In cgroup v1, child quotas cannot exceed parent quotas

Why It Appears Node-Specific

The issue seems to only affect "previously used nodes" because:

When containerd picks 409600 and the pod fails, the parent cgroup gets stuck
The pause container remains alive with the 409600 parent cgroup
All subsequent attempts to create the pod on that node fail (child 410000 > parent 409600)
Fresh nodes might get lucky and containerd picks 410000 → works fine
But those nodes would fail too if containerd had picked 409600 on first attempt

This is not about stale cgroups from old pods - it's about which value containerd randomly picks during pod sandbox creation.

Error Message

failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: failed to write "410000": write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc65bd648_3faf_4778_90e4_a21afb2a6ad0.slice/cri-containerd-149d004f6e52b5665c6209d1f33a7e516049b79456444e3f74af49e62c5c80c8.scope/cpu.cfs_quota_us: invalid argument: unknown

Evidence from Investigation

Failing node (containerd picked 409600):

# Parent cgroup quota - containerd calculated 409600 $ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-podc65bd648...slice/cpu.cfs_quota_us 409600 # Pod sandbox metadata confirms $ crictl inspectp b4420139f34f8 "cpu_quota": 409600 # Containerd logs show runc trying to write 410000 $ journalctl -u containerd | grep "410000" failed to write "410000": write .../cpu.cfs_quota_us: invalid argument

Working node (containerd picked 410000):

# Parent cgroup quota - containerd calculated 410000! $ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-pod7f7424a2...slice/cpu.cfs_quota_us 410000 # Application container child cgroup also 410000 $ cat /sys/fs/cgroup/cpu,cpuacct/.../cri-containerd-98745b3c2c216...scope/cpu.cfs_quota_us 410000 # They match - no error!

Both nodes running:

Same containerd version: 1.7.27
Same runc version: 1.3.2
Same Kubernetes version: 1.30.14-eks-113cf36
Same pod spec with CPU limit: 4096m

Additional Context

This issue started occurring after changing CPU limits from 8192m → 4096m
The problem is specific to CPU values resulting in fractional cores (4.096)
The 400 microsecond difference (410000 - 409600) violates cgroup v1's parent-child quota constraint
Critical finding: Same containerd version behaves differently - this is non-deterministic
Calculation theory:
- Correct: 4096 / 1000 * 100000 = 409600
- Rounded: 4.1 * 100000 = 410000 (rounding 4.096 to 4.1)

Questions for Maintainers

Where in containerd's codebase does the millicore → microsecond conversion happen for pod sandbox creation?
Why would containerd calculate two different values (409600 vs 410000) for the same input (4096m)?
Is there a race condition or different code path that causes this non-determinism?
Should containerd and runc be using shared conversion logic to ensure consistency?

Related Issues

This appears similar to but distinct from:

systemd driver updates CPU quota inconsitently #4622 - Systemd rounding to nearest 10ms (closed/fixed)
Reducing CPU period fails for subsystems if existing parent has quota>0 with systemd driver #3084 - Parent quota>0 with period changes (closed/fixed)
Error starting container - failed to write to cpu.cfs_quota_us kubernetes/kubernetes#61192 - Systemd rounding from 2018 (closed/fixed)

However, this is a new issue involving non-deterministic behavior in containerd 1.7.27 when calculating CPU quotas for fractional core values with systemd cgroup driver.

Steps to reproduce the issue

Deploy a Kubernetes pod with CPU limit 4096m multiple times on different fresh nodes
- Observe: Some pods succeed, some fail (non-deterministic)
- Successful pods: containerd calculated parent cgroup cpu.cfs_quota_us = 410000
- Failed pods: containerd calculated parent cgroup cpu.cfs_quota_us = 409600
- runc always tries to write 410000 for child cgroup
On nodes where containerd picked 409600:
- runc attempts to create application container
- runc tries to write 410000 to child cgroup's cpu.cfs_quota_us
- Kernel rejects: child quota (410000) > parent quota (409600)
- Container creation fails with "invalid argument" error
- Pod enters CrashLoopBackOff
- Pause container remains alive with parent cgroup stuck at 409600
All subsequent restart attempts on that node continue to fail
- Containerd reuses the existing pod sandbox
- Parent cgroup still has 409600
- runc still tries 410000
- Pattern repeats indefinitely
Evicting the pod and forcing it to a different node:
- May work if containerd picks 410000 on the new node
- Will fail if containerd picks 409600 on the new node
- Outcome is non-deterministic

Example pod spec that reproduces the issue:

apiVersion: v1 kind: Pod metadata: name: test-pod spec: containers: - name: test-container image: nginx:latest resources: limits: cpu: "4096m" memory: "16Gi" requests: cpu: "1024m" memory: "8Gi"

How to Verify Which Value Containerd Picked

On a node where the pod was deployed:

# Get pod UID kubectl get pod <pod-name> -o jsonpath='{.metadata.uid}' # Check parent cgroup on the node cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID_with_underscores>.slice/cpu.cfs_quota_us # 409600 = pod will fail # 410000 = pod will succeed

Critical Note: The issue is not about "previously used nodes" - it's about which value containerd randomly calculates during initial pod sandbox creation. The appearance of being node-specific is because once a node gets stuck with 409600, it stays stuck.

Describe the results you received and expected

Expected behavior:

Containerd and runc should use consistent, deterministic calculations when converting millicores to microseconds for CPU quotas.

For a CPU limit of 4096m:

Both containerd and runc should calculate: 4096 / 1000 * 100000 = 409600 microseconds
OR both should calculate: 4.1 * 100000 = 410000 microseconds
They must agree - parent and child cgroups must have compatible values
Container should create successfully every time, regardless of node
Behavior should be deterministic, not random

Actual behavior:

Containerd: Non-deterministically calculates either 409600 or 410000 for the same input
- Sometimes: 409600 microseconds (mathematically correct)
- Sometimes: 410000 microseconds (rounded)
- No obvious pattern - same version, same config, different results
runc: Consistently calculates 410000 microseconds (always rounds 4.096 to 4.1)
When they mismatch (containerd=409600, runc=410000):
- Child cgroup creation fails with kernel error: "invalid argument"
- Pod enters CrashLoopBackOff with 199+ restart attempts
- Parent cgroup gets stuck with 409600, preventing all future attempts
- Requires manual node cordoning and pod eviction
When they match (containerd=410000, runc=410000):
- Pod works perfectly fine

Impact:

Non-deterministic pod scheduling - same pod spec may work or fail randomly
Cannot reliably deploy pods with CPU limit 4096m (or other fractional core values)
Once a node "loses the lottery" and gets 409600, it's permanently broken for that pod
Requires operational workarounds (cordon/drain/evict)
Production impact on Amazon EKS clusters

Root Issue:

This is fundamentally a consistency bug - containerd and runc must use the same conversion logic, and that logic must be deterministic.

What version of runc are you using?

runc version 1.3.2 commit: aeabe4e711d903ef0ea86a4155da0f9e00eabd29 spec: 1.2.1 go: go1.24.9 libseccomp: 2.5.2

Additional environment details:

containerd version: 1.7.27 (commit: 05044ec0a9a75232cad458027ca83437aae3f4da)
Kubernetes version: 1.30.14-eks-113cf36 (Amazon EKS)
Cgroup version: v1
Cgroup driver: systemd (SystemdCgroup = true in containerd config at /etc/containerd/config.toml)

Host OS information

NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2026-06-30"

Platform: Amazon EKS (Elastic Kubernetes Service) managed node

Host kernel information

Linux ip-10-7-66-184.prod-eks.newfront.com 5.10.245-241.976.amzn2.x86_64 #1 SMP Tue Oct 21 22:09:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Kernel version: 5.10.245-241.976.amzn2.x86_64

Read Entire Article