Description
When using the systemd cgroup driver with a CPU limit of 4096m, pod creation fails intermittently because containerd non-deterministically calculates either 409600 or 410000 microseconds for the parent cgroup, while runc consistently calculates 410000 for child cgroups. When they mismatch, the Linux kernel rejects the child cgroup creation with "invalid argument".
Root Cause
Investigation reveals non-deterministic behavior in containerd when converting 4096m to microseconds:
-
Containerd (when creating pod sandbox) - INCONSISTENT:
- Sometimes calculates: 4096m → 409600 microseconds (correct: 4096 / 1000 * 100000)
- Sometimes calculates: 4096m → 410000 microseconds (rounded: 4.1 * 100000)
- Sets parent cgroup: cpu.cfs_quota_us to whichever value it calculated
-
runc (when creating application container) - CONSISTENT:
- Always calculates: 4096m → 410000 microseconds (appears to round 4.096 to 4.1)
- Tries to set child cgroup: cpu.cfs_quota_us = 410000
-
Result:
- When containerd picks 410000: Parent = 410000, child = 410000 → Success!
- When containerd picks 409600: Parent = 409600, child = 410000 → Kernel rejects! (child > parent)
- In cgroup v1, child quotas cannot exceed parent quotas
Why It Appears Node-Specific
The issue seems to only affect "previously used nodes" because:
- When containerd picks 409600 and the pod fails, the parent cgroup gets stuck
- The pause container remains alive with the 409600 parent cgroup
- All subsequent attempts to create the pod on that node fail (child 410000 > parent 409600)
- Fresh nodes might get lucky and containerd picks 410000 → works fine
- But those nodes would fail too if containerd had picked 409600 on first attempt
This is not about stale cgroups from old pods - it's about which value containerd randomly picks during pod sandbox creation.
Error Message
Evidence from Investigation
Failing node (containerd picked 409600):
Working node (containerd picked 410000):
Both nodes running:
- Same containerd version: 1.7.27
- Same runc version: 1.3.2
- Same Kubernetes version: 1.30.14-eks-113cf36
- Same pod spec with CPU limit: 4096m
Additional Context
- This issue started occurring after changing CPU limits from 8192m → 4096m
- The problem is specific to CPU values resulting in fractional cores (4.096)
- The 400 microsecond difference (410000 - 409600) violates cgroup v1's parent-child quota constraint
- Critical finding: Same containerd version behaves differently - this is non-deterministic
- Calculation theory:
- Correct: 4096 / 1000 * 100000 = 409600
- Rounded: 4.1 * 100000 = 410000 (rounding 4.096 to 4.1)
Questions for Maintainers
- Where in containerd's codebase does the millicore → microsecond conversion happen for pod sandbox creation?
- Why would containerd calculate two different values (409600 vs 410000) for the same input (4096m)?
- Is there a race condition or different code path that causes this non-determinism?
- Should containerd and runc be using shared conversion logic to ensure consistency?
Related Issues
This appears similar to but distinct from:
- systemd driver updates CPU quota inconsitently #4622 - Systemd rounding to nearest 10ms (closed/fixed)
- Reducing CPU period fails for subsystems if existing parent has quota>0 with systemd driver #3084 - Parent quota>0 with period changes (closed/fixed)
- Error starting container - failed to write to cpu.cfs_quota_us kubernetes/kubernetes#61192 - Systemd rounding from 2018 (closed/fixed)
However, this is a new issue involving non-deterministic behavior in containerd 1.7.27 when calculating CPU quotas for fractional core values with systemd cgroup driver.
Steps to reproduce the issue
-
Deploy a Kubernetes pod with CPU limit 4096m multiple times on different fresh nodes
- Observe: Some pods succeed, some fail (non-deterministic)
- Successful pods: containerd calculated parent cgroup cpu.cfs_quota_us = 410000
- Failed pods: containerd calculated parent cgroup cpu.cfs_quota_us = 409600
- runc always tries to write 410000 for child cgroup
-
On nodes where containerd picked 409600:
- runc attempts to create application container
- runc tries to write 410000 to child cgroup's cpu.cfs_quota_us
- Kernel rejects: child quota (410000) > parent quota (409600)
- Container creation fails with "invalid argument" error
- Pod enters CrashLoopBackOff
- Pause container remains alive with parent cgroup stuck at 409600
-
All subsequent restart attempts on that node continue to fail
- Containerd reuses the existing pod sandbox
- Parent cgroup still has 409600
- runc still tries 410000
- Pattern repeats indefinitely
-
Evicting the pod and forcing it to a different node:
- May work if containerd picks 410000 on the new node
- Will fail if containerd picks 409600 on the new node
- Outcome is non-deterministic
Example pod spec that reproduces the issue:
How to Verify Which Value Containerd Picked
On a node where the pod was deployed:
Critical Note: The issue is not about "previously used nodes" - it's about which value containerd randomly calculates during initial pod sandbox creation. The appearance of being node-specific is because once a node gets stuck with 409600, it stays stuck.
Describe the results you received and expected
Expected behavior:
Containerd and runc should use consistent, deterministic calculations when converting millicores to microseconds for CPU quotas.
For a CPU limit of 4096m:
- Both containerd and runc should calculate: 4096 / 1000 * 100000 = 409600 microseconds
- OR both should calculate: 4.1 * 100000 = 410000 microseconds
- They must agree - parent and child cgroups must have compatible values
- Container should create successfully every time, regardless of node
- Behavior should be deterministic, not random
Actual behavior:
- Containerd: Non-deterministically calculates either 409600 or 410000 for the same input
- Sometimes: 409600 microseconds (mathematically correct)
- Sometimes: 410000 microseconds (rounded)
- No obvious pattern - same version, same config, different results
- runc: Consistently calculates 410000 microseconds (always rounds 4.096 to 4.1)
- When they mismatch (containerd=409600, runc=410000):
- Child cgroup creation fails with kernel error: "invalid argument"
- Pod enters CrashLoopBackOff with 199+ restart attempts
- Parent cgroup gets stuck with 409600, preventing all future attempts
- Requires manual node cordoning and pod eviction
- When they match (containerd=410000, runc=410000):
- Pod works perfectly fine
Impact:
- Non-deterministic pod scheduling - same pod spec may work or fail randomly
- Cannot reliably deploy pods with CPU limit 4096m (or other fractional core values)
- Once a node "loses the lottery" and gets 409600, it's permanently broken for that pod
- Requires operational workarounds (cordon/drain/evict)
- Production impact on Amazon EKS clusters
Root Issue:
This is fundamentally a consistency bug - containerd and runc must use the same conversion logic, and that logic must be deterministic.
What version of runc are you using?
Additional environment details:
- containerd version: 1.7.27 (commit: 05044ec0a9a75232cad458027ca83437aae3f4da)
- Kubernetes version: 1.30.14-eks-113cf36 (Amazon EKS)
- Cgroup version: v1
- Cgroup driver: systemd (SystemdCgroup = true in containerd config at /etc/containerd/config.toml)
Host OS information
Platform: Amazon EKS (Elastic Kubernetes Service) managed node
Host kernel information
Kernel version: 5.10.245-241.976.amzn2.x86_64
.png)
