Container Resource Management
Your Java app needs 512MB but uses 2GB. Your Node.js service gets CPU throttled mysteriously. Containers die with no logs. This article explains how Docker limits resources, how to size containers correctly, and how to avoid the OOM killer.
📋 At a Glance
| Aspect | Details |
|---|---|
| Topic | Memory limits, CPU allocation, OOM killer, JVM/Node.js tuning |
| Complexity | Advanced |
| Prerequisites | Part 1 (Container Internals - cgroups) |
| Key Insight | Container limits are enforced by kernel cgroups, not the application |
| Time to Master | 3-4 hours |
🎯 What You'll Learn
- Memory management - hard limits, soft limits, swap, OOM behavior
- CPU management - shares, quota, pinning, throttling
- Language-specific tuning - JVM, Node.js, Python in containers
- Right-sizing - how to determine appropriate limits
- Monitoring - detecting resource problems before they cause outages
🔥 Production Story: The OOM Serial Killer
A team deployed microservices to Kubernetes. Pods kept restarting randomly - some after hours, some after minutes. No logs, no errors, just disappeared.
BASH(8 lines)CodeLoading syntax highlighter...
YAML(5 lines)CodeLoading syntax highlighter...
DOCKERFILE(3 lines)CodeLoading syntax highlighter...
DOCKERFILE(7 lines)CodeLoading syntax highlighter...
Now JVM allocates 75% of 512MB = 384MB heap, leaving room for metaspace, threads, and OS overhead.
🧠 Mental Model: Resource Limits Stack
┌─────────────────────────────────────────────────────────────────────────┐ │ CONTAINER RESOURCE LIMITS │ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ APPLICATION LAYER │ │ │ │ │ │ JVM: -Xmx, -XX:MaxRAMPercentage │ │ │ Node: --max-old-space-size │ │ │ Python: Various pool sizes │ │ │ │ │ │ ⚠️ App may try to use MORE than container allows! │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ CONTAINER LAYER │ │ │ │ │ │ docker run --memory=512m --cpus=1.5 │ │ │ │ │ │ These translate to cgroup limits in kernel │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ CGROUP LAYER │ │ │ │ │ │ memory.max = 536870912 (512MB) │ │ │ cpu.max = 150000 100000 (1.5 CPUs) │ │ │ │ │ │ Kernel ENFORCES these limits │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ KERNEL ACTIONS │ │ │ │ │ │ Memory exceeded → OOM Killer → SIGKILL (exit 137) │ │ │ CPU exceeded → Throttling (slower, not killed) │ │ │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ └─────────────────────────────────────────────────────────────────────────┘
🔬 Deep Dive
Memory Limits
BASH(14 lines)CodeLoading syntax highlighter...
┌─────────────────────────────────────────────────────────────────┐ │ CONTAINER MEMORY │ │ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Application Heap │ │ │ │ (Java: -Xmx, Node: --max-old-space-size) │ │ │ ├────────────────────────────────────────────────────────────┤ │ │ │ Application Off-Heap │ │ │ │ (JVM metaspace, native memory, thread stacks) │ │ │ ├────────────────────────────────────────────────────────────┤ │ │ │ Kernel Buffers & Cache │ │ │ │ (File system cache, network buffers) │ │ │ ├────────────────────────────────────────────────────────────┤ │ │ │ Shared Libraries │ │ │ │ (libc, language runtime, etc.) │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ │ Total counted against cgroup memory limit │ │ │ │ ⚠️ Kernel cache is usually reclaimable, │ │ but still counts until reclaimed! │ └─────────────────────────────────────────────────────────────────┘
BASH(13 lines)CodeLoading syntax highlighter...
CPU Limits
Docker provides multiple ways to limit CPU:
BASH(6 lines)CodeLoading syntax highlighter...
BASH(6 lines)CodeLoading syntax highlighter...
BASH(7 lines)CodeLoading syntax highlighter...
| Scenario | Use | Why |
|---|---|---|
| Predictable workload | --cpus | Guaranteed limit |
| Batch jobs | --cpu-shares | Fair sharing |
| Latency-sensitive | --cpuset-cpus | Cache locality |
| Resource guarantee | --cpus + --cpu-shares | Both limit and priority |
BASH(11 lines)CodeLoading syntax highlighter...
JVM Container Tuning
Modern JVMs (8u191+, 11+) are container-aware but need configuration:
DOCKERFILE(13 lines)CodeLoading syntax highlighter...
Container: 512MB ├── Heap: 75% = 384MB (-XX:MaxRAMPercentage=75.0) ├── Metaspace: ~64MB (class metadata) ├── Thread stacks: ~1MB per thread ├── Native memory: varies ├── JIT code cache: ~48MB └── Other: GC, buffers ⚠️ If heap = container limit, you WILL OOM Always leave 20-30% for non-heap
BASH(7 lines)CodeLoading syntax highlighter...
Node.js Container Tuning
Node.js has its own memory management:
DOCKERFILE(9 lines)CodeLoading syntax highlighter...
BASH(5 lines)CodeLoading syntax highlighter...
DOCKERFILE(3 lines)CodeLoading syntax highlighter...
Python Container Tuning
Python's memory management is more automatic, but:
DOCKERFILE(12 lines)CodeLoading syntax highlighter...
BASH(6 lines)CodeLoading syntax highlighter...
Monitoring Container Resources
BASH(7 lines)CodeLoading syntax highlighter...
BASH(11 lines)CodeLoading syntax highlighter...
YAML(19 lines)CodeLoading syntax highlighter...
Right-Sizing Containers
1. Profile application under load └─ Measure peak memory, average CPU 2. Add safety margin └─ Memory: +20-30% for GC, buffers └─ CPU: Consider burst capacity 3. Test under various conditions └─ Cold start └─ Peak load └─ Memory leak scenarios 4. Monitor in production └─ Actual usage vs limits └─ Throttling frequency └─ OOM events 5. Iterate └─ Adjust based on real data
| App Type | Memory Suggestion | CPU Suggestion |
|---|---|---|
| JVM microservice | 512MB - 1GB | 0.5 - 2 |
| Node.js API | 256MB - 512MB | 0.25 - 1 |
| Python web | 256MB - 512MB | 0.25 - 1 |
| ML inference | 2GB - 8GB | 1 - 4 |
| Static content | 64MB - 128MB | 0.1 - 0.25 |
⚠️ Common Mistakes
Mistake 1: Setting Heap = Container Limit
BASH(8 lines)CodeLoading syntax highlighter...
Mistake 2: Ignoring CPU Throttling
BASH(7 lines)CodeLoading syntax highlighter...
Mistake 3: Not Setting Any Limits
BASH(5 lines)CodeLoading syntax highlighter...
🐛 Debug This: The Mysterious Slow Container
A developer reports: "My container runs fine with 2 CPUs but is incredibly slow with 0.5 CPU. It's not even using all 0.5 CPU according to docker stats!"
BASH(8 lines)CodeLoading syntax highlighter...
--cpus=0.5 means the container can use 50ms of CPU time per 100ms period. If the app does burst computation:Period 1 (0-100ms): ├── 0-50ms: App runs at 100% (using its quota) ├── 50-100ms: App THROTTLED (waiting for next period) └── Apparent usage: 50% of period, but app waited 50ms! Period 2 (100-200ms): ├── 100-150ms: App runs again ├── 150-200ms: Throttled again └── And so on...
BASH(4 lines)CodeLoading syntax highlighter...
-
Increase CPU limit if app needs it:BASHCodeLoading syntax highlighter...
-
Use CPU shares instead if you want burst capability:BASH(2 lines)CodeLoading syntax highlighter...
-
Optimize the app - reduce CPU-intensive operations
-
Adjust period (advanced):BASH(2 lines)CodeLoading syntax highlighter...
💻 Exercises
Exercise 1: Observe OOM Killer
⭐ Difficulty: Easy | ⏱️ Time: 15 minutes
BASH(21 lines)CodeLoading syntax highlighter...
Exercise 2: Measure CPU Throttling
⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes
BASH(30 lines)CodeLoading syntax highlighter...
Exercise 3: JVM Container Sizing
⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes
BASH(30 lines)CodeLoading syntax highlighter...
Exercise 4: Find the Right Limit
⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes
BASH(42 lines)CodeLoading syntax highlighter...
Exercise 5: Complete Resource Configuration
⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes
Create a production-ready docker-compose with proper resource management:
YAML(18 lines)CodeLoading syntax highlighter...
🎤 Senior-Level Interview Questions
Q1: Explain the difference between --memory and --memory-reservation.
"These are hard vs soft limits:
- Enforced by kernel cgroups
- If exceeded: OOM killer terminates the container
- Container cannot use more than this
- Scheduling hint for orchestrators
- Not enforced when memory is available
- When host is under pressure, kernel tries to reclaim down to reservation
- Container can exceed this if host has free memory
BASHCodeLoading syntax highlighter...
The container:
- Guaranteed 256MB (orchestrator won't over-commit below this)
- Can burst to 512MB when available
- Is killed if it tries to exceed 512MB
requests maps to reservation, limits maps to hard limit. Set requests to typical usage, limits to peak + buffer.I typically set reservation to 50-70% of limit for services with variable memory patterns."
Q2: A Java container keeps getting OOM killed despite setting -Xmx512m in a 512MB container. Why?
"JVM uses more than just heap memory. The OOM is because total JVM memory exceeds container limit.
Total JVM Memory = Heap (-Xmx) + Metaspace (class metadata, ~64MB+) + Thread stacks (~1MB per thread) + Code cache (JIT compiled code, ~48MB) + Native memory (NIO buffers, JNI) + GC overhead
- Heap: 512MB
- Metaspace: 64MB
- 50 threads: 50MB
- Code cache: 48MB
- Other: varies
- Total: ~700MB in a 512MB container = OOM
-
Reduce heap, leave room for overhead:BASHCodeLoading syntax highlighter...
-
Use container-aware settings:BASHCodeLoading syntax highlighter...
JVM calculates 75% of container limit automatically.
-
Limit other areas:BASH(3 lines)CodeLoading syntax highlighter...
Rule of thumb: Heap should be 65-75% of container limit maximum."
Q3: How do you detect and handle CPU throttling?
"CPU throttling happens when a container exceeds its CPU quota within a scheduling period.
BASH(4 lines)CodeLoading syntax highlighter...
- High latency spikes
- Lower than expected throughput
- docker stats showing less CPU usage than limit
- Prometheus:
container_cpu_cfs_throttled_periods_total - Alert when throttling rate exceeds threshold
-
Increase CPU limit if legitimate need:BASHCodeLoading syntax highlighter...
-
Adjust period for lower latency:BASH(2 lines)CodeLoading syntax highlighter...
-
Use CPU shares for burst capability:BASH(2 lines)CodeLoading syntax highlighter...
-
Optimize application:
- Reduce CPU-intensive operations
- Add caching
- Async processing
For latency-sensitive services, I prefer slightly over-provisioning CPU to avoid any throttling. For batch jobs, throttling is acceptable."
Q4: How do you right-size containers for a new application?
"I follow a data-driven process:
BASH(2 lines)CodeLoading syntax highlighter...
- Peak memory usage
- Memory growth pattern
- GC behavior (for JVM)
- Average utilization
- Peak spikes
- Throttling sensitivity
Memory limit = Peak usage * 1.2 to 1.3 (20-30% buffer) Memory reservation = Average usage CPU limit = Peak requirement + 20% buffer CPU request = Average usage
- Cold start
- Sustained load
- Spike load
- Memory leak scenarios
- Track actual usage vs limits
- Adjust based on real production data
- Set up alerts for approaching limits
YAML(7 lines)CodeLoading syntax highlighter...
Key principle: Don't guess, measure. Profile in staging, validate in production, iterate."
Q5: What's the difference between --cpus, --cpu-shares, and --cpuset-cpus?
"These control different aspects of CPU allocation:
- Limits total CPU time
--cpus=1.5means container can use 150% of one core- Enforced regardless of host CPU availability
- Container is throttled if it tries to use more
- Use when: Predictable resource allocation, multi-tenant environments
- Default is 1024
- Only matters under contention
--cpu-shares=512gets half the CPU of a 1024 container- Can burst to full CPU if no contention
- Use when: Batch jobs, where burst capability is valuable
- Restricts to specific CPU cores
--cpuset-cpus=0,1only runs on cores 0 and 1- Good for: NUMA locality, cache optimization, isolating workloads
- Use when: Latency-sensitive apps, NUMA-aware deployments
BASH(5 lines)CodeLoading syntax highlighter...
This container:
- Limited to 1.5 CPUs worth of time
- Gets priority over default containers when competing
- Only scheduled on cores 0 and 1"
📝 Summary & Key Takeaways
Resource Limits Quick Reference
| Resource | Flag | Enforcement |
|---|---|---|
| Memory hard limit | --memory | OOM kill |
| Memory soft limit | --memory-reservation | Reclaim under pressure |
| CPU quota | --cpus | Throttling |
| CPU weight | --cpu-shares | Proportional sharing |
| CPU pinning | --cpuset-cpus | Core restriction |
Application Tuning Summary
| Runtime | Memory Flag | Recommendation |
|---|---|---|
| JVM | -XX:MaxRAMPercentage=75.0 | Leave 25% for non-heap |
| Node.js | --max-old-space-size=N | Set to ~75% of limit in MB |
| Python | N/A | Generally automatic |
Golden Rules
- Always set limits in production - Prevent runaway containers
- Leave headroom - Heap ≠ container limit
- Monitor throttling - It causes latency, not just reduced throughput
- Profile first - Measure, don't guess
- Make apps container-aware - JVM, Node need explicit configuration
📋 Quick Reference
Common Commands
BASH(15 lines)CodeLoading syntax highlighter...
JVM Flags
BASH(4 lines)CodeLoading syntax highlighter...
Exit Codes
| Code | Meaning |
|---|---|
| 137 | SIGKILL (OOM or docker kill) |
| 143 | SIGTERM (graceful shutdown) |
| 139 | SIGSEGV (segmentation fault) |
📅 Review Schedule
| Day | Task | Time |
|---|---|---|
| Day 1 | Review memory/CPU limit flags | 10 min |
| Day 3 | Do Exercise 1 (observe OOM) | 15 min |
| Day 7 | Configure JVM for container in real project | 20 min |
| Day 14 | Profile and right-size a service | 30 min |
| Day 30 | Audit all services for resource limits | 30 min |
📚 Series Navigation
| Previous | Current | Next |
|---|---|---|
| Part 8: Build Configuration | Part 9: Resource Management | Part 10: Volumes & Storage |
- Part 0: How to Use This Series
- Part 1: Container Internals
- Part 2: Image Anatomy
- Part 3: Build Process Deep Dive
- Part 4: Networking Internals
- Part 5: Dockerfile Optimization Patterns
- Part 6: Multi-Stage Builds: Beyond Basics
- Part 7: Base Image Selection & Security
- Part 8: ARG, ENV & Build-Time Configuration
- Part 9: Container Resource Management ← You are here
- Part 10: Volume Patterns & Data Persistence