Container Resource Management

Your Java app needs 512MB but uses 2GB. Your Node.js service gets CPU throttled mysteriously. Containers die with no logs. This article explains how Docker limits resources, how to size containers correctly, and how to avoid the OOM killer.

📋 At a Glance

Aspect	Details
Topic	Memory limits, CPU allocation, OOM killer, JVM/Node.js tuning
Complexity	Advanced
Prerequisites	Part 1 (Container Internals - cgroups)
Key Insight	Container limits are enforced by kernel cgroups, not the application
Time to Master	3-4 hours

🎯 What You'll Learn

Memory management - hard limits, soft limits, swap, OOM behavior
CPU management - shares, quota, pinning, throttling
Language-specific tuning - JVM, Node.js, Python in containers
Right-sizing - how to determine appropriate limits
Monitoring - detecting resource problems before they cause outages

🔥 Production Story: The OOM Serial Killer

A team deployed microservices to Kubernetes. Pods kept restarting randomly - some after hours, some after minutes. No logs, no errors, just disappeared.

Investigation:

BASH(8 lines)
Code
Loading syntax highlighter...

The setup:

YAML(5 lines)
Code
Loading syntax highlighter...

DOCKERFILE(3 lines)
Code
Loading syntax highlighter...

Root cause: JVM default heap is 25% of physical memory (host's 32GB), not container limit. JVM tried to allocate 8GB heap inside 512MB container.

The fix:

DOCKERFILE(7 lines)
Code
Loading syntax highlighter...

Now JVM allocates 75% of 512MB = 384MB heap, leaving room for metaspace, threads, and OS overhead.

Lesson: Applications don't automatically respect container limits. You must configure them to be container-aware.

🧠 Mental Model: Resource Limits Stack

┌─────────────────────────────────────────────────────────────────────────┐
│                       CONTAINER RESOURCE LIMITS                         │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐
│  │                        APPLICATION LAYER                             │
│  │                                                                      │
│  │  JVM: -Xmx, -XX:MaxRAMPercentage                                     │
│  │  Node: --max-old-space-size                                          │
│  │  Python: Various pool sizes                                          │
│  │                                                                      │
│  │  ⚠️  App may try to use MORE than container allows!                  │
│  └──────────────────────────────────────────────────────────────────────┘
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐
│  │                        CONTAINER LAYER                               │
│  │                                                                      │
│  │  docker run --memory=512m --cpus=1.5                                 │
│  │                                                                      │
│  │  These translate to cgroup limits in kernel                          │
│  └──────────────────────────────────────────────────────────────────────┘
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐
│  │                         CGROUP LAYER                                 │
│  │                                                                      │
│  │  memory.max = 536870912 (512MB)                                      │
│  │  cpu.max = 150000 100000 (1.5 CPUs)                                  │
│  │                                                                      │
│  │  Kernel ENFORCES these limits                                        │
│  └──────────────────────────────────────────────────────────────────────┘
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐
│  │                         KERNEL ACTIONS                               │
│  │                                                                      │
│  │  Memory exceeded → OOM Killer → SIGKILL (exit 137)                   │
│  │  CPU exceeded → Throttling (slower, not killed)                      │
│  │                                                                      │
│  └──────────────────────────────────────────────────────────────────────┘
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

🔬 Deep Dive

Memory Limits

Setting memory limits:

BASH(14 lines)
Code
Loading syntax highlighter...

What counts toward memory limit:

┌─────────────────────────────────────────────────────────────────┐
│                    CONTAINER MEMORY                             │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  Application Heap                                          │ │
│  │  (Java: -Xmx, Node: --max-old-space-size)                  │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │  Application Off-Heap                                      │ │
│  │  (JVM metaspace, native memory, thread stacks)             │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │  Kernel Buffers & Cache                                    │ │
│  │  (File system cache, network buffers)                      │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │  Shared Libraries                                          │ │
│  │  (libc, language runtime, etc.)                            │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                 │
│  Total counted against cgroup memory limit                      │
│                                                                 │
│  ⚠️ Kernel cache is usually reclaimable,                        │
│     but still counts until reclaimed!                           │
└─────────────────────────────────────────────────────────────────┘

OOM Killer behavior:

BASH(13 lines)
Code
Loading syntax highlighter...

CPU Limits

Docker provides multiple ways to limit CPU:

1. CPU quota (--cpus) - Hard limit:

BASH(6 lines)
Code
Loading syntax highlighter...

2. CPU shares (--cpu-shares) - Relative weight:

BASH(6 lines)
Code
Loading syntax highlighter...

3. CPU pinning (--cpuset-cpus) - Dedicated cores:

BASH(7 lines)
Code
Loading syntax highlighter...

When to use which:

Scenario	Use	Why
Predictable workload	`--cpus`	Guaranteed limit
Batch jobs	`--cpu-shares`	Fair sharing
Latency-sensitive	`--cpuset-cpus`	Cache locality
Resource guarantee	`--cpus` + `--cpu-shares`	Both limit and priority

CPU throttling:

BASH(11 lines)
Code
Loading syntax highlighter...

JVM Container Tuning

Modern JVMs (8u191+, 11+) are container-aware but need configuration:

DOCKERFILE(13 lines)
Code
Loading syntax highlighter...

Memory breakdown for JVM in container:

Container: 512MB
├── Heap: 75% = 384MB (-XX:MaxRAMPercentage=75.0)
├── Metaspace: ~64MB (class metadata)
├── Thread stacks: ~1MB per thread
├── Native memory: varies
├── JIT code cache: ~48MB
└── Other: GC, buffers

⚠️ If heap = container limit, you WILL OOM
   Always leave 20-30% for non-heap

Checking JVM settings:

BASH(7 lines)
Code
Loading syntax highlighter...

Node.js Container Tuning

Node.js has its own memory management:

DOCKERFILE(9 lines)
Code
Loading syntax highlighter...

Calculating Node.js memory:

BASH(5 lines)
Code
Loading syntax highlighter...

UV_THREADPOOL_SIZE for I/O-bound apps:

DOCKERFILE(3 lines)
Code
Loading syntax highlighter...

Python Container Tuning

Python's memory management is more automatic, but:

DOCKERFILE(12 lines)
Code
Loading syntax highlighter...

Memory-intensive Python (NumPy, Pandas):

BASH(6 lines)
Code
Loading syntax highlighter...

Monitoring Container Resources

Real-time stats:

BASH(7 lines)
Code
Loading syntax highlighter...

Inspect cgroup limits:

BASH(11 lines)
Code
Loading syntax highlighter...

Prometheus metrics:

YAML(19 lines)
Code
Loading syntax highlighter...

Right-Sizing Containers

Process for determining limits:

1. Profile application under load
   └─ Measure peak memory, average CPU

2. Add safety margin
   └─ Memory: +20-30% for GC, buffers
   └─ CPU: Consider burst capacity

3. Test under various conditions
   └─ Cold start
   └─ Peak load
   └─ Memory leak scenarios

4. Monitor in production
   └─ Actual usage vs limits
   └─ Throttling frequency
   └─ OOM events

5. Iterate
   └─ Adjust based on real data

Common sizing guidelines:

App Type	Memory Suggestion	CPU Suggestion
JVM microservice	512MB - 1GB	0.5 - 2
Node.js API	256MB - 512MB	0.25 - 1
Python web	256MB - 512MB	0.25 - 1
ML inference	2GB - 8GB	1 - 4
Static content	64MB - 128MB	0.1 - 0.25

⚠️ Common Mistakes

Mistake 1: Setting Heap = Container Limit

BASH(8 lines)
Code
Loading syntax highlighter...

Mistake 2: Ignoring CPU Throttling

BASH(7 lines)
Code
Loading syntax highlighter...

Mistake 3: Not Setting Any Limits

BASH(5 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Mysterious Slow Container

A developer reports: "My container runs fine with 2 CPUs but is incredibly slow with 0.5 CPU. It's not even using all 0.5 CPU according to docker stats!"

BASH(8 lines)
Code
Loading syntax highlighter...

Why is the app slow if it's not hitting CPU limit?

✅ Solution:

The app is being throttled, not fully utilizing the CPU.

Understanding CPU throttling:

--cpus=0.5 means the container can use 50ms of CPU time per 100ms period. If the app does burst computation:

Period 1 (0-100ms):
├── 0-50ms: App runs at 100% (using its quota)
├── 50-100ms: App THROTTLED (waiting for next period)
└── Apparent usage: 50% of period, but app waited 50ms!

Period 2 (100-200ms):
├── 100-150ms: App runs again
├── 150-200ms: Throttled again
└── And so on...

docker stats shows average - 35% average doesn't show the throttling pattern.

Diagnosis:

BASH(4 lines)
Code
Loading syntax highlighter...

Solutions:

Increase CPU limit if app needs it:
```
BASH
Code
Loading syntax highlighter...
```
Use CPU shares instead if you want burst capability:
```
BASH(2 lines)
Code
Loading syntax highlighter...
```
Optimize the app - reduce CPU-intensive operations

Adjust period (advanced):

BASH(2 lines)
Code
Loading syntax highlighter...

Key insight: CPU throttling causes latency, not just reduced throughput. A burst-y application can feel much slower than CPU % suggests.

💻 Exercises

Exercise 1: Observe OOM Killer

⭐ Difficulty: Easy | ⏱️ Time: 15 minutes

BASH(21 lines)
Code
Loading syntax highlighter...

Exercise 2: Measure CPU Throttling

⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes

BASH(30 lines)
Code
Loading syntax highlighter...

Exercise 3: JVM Container Sizing

⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes

BASH(30 lines)
Code
Loading syntax highlighter...

Exercise 4: Find the Right Limit

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes

BASH(42 lines)
Code
Loading syntax highlighter...

Exercise 5: Complete Resource Configuration

⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes

Create a production-ready docker-compose with proper resource management:

YAML(18 lines)
Code
Loading syntax highlighter...

🎤 Senior-Level Interview Questions

Q1: Explain the difference between --memory and --memory-reservation.

Strong Answer:

"These are hard vs soft limits:

--memory (hard limit):

Enforced by kernel cgroups
If exceeded: OOM killer terminates the container
Container cannot use more than this

--memory-reservation (soft limit):

Scheduling hint for orchestrators
Not enforced when memory is available
When host is under pressure, kernel tries to reclaim down to reservation
Container can exceed this if host has free memory

Example:

BASH
Code
Loading syntax highlighter...

The container:

Guaranteed 256MB (orchestrator won't over-commit below this)
Can burst to 512MB when available
Is killed if it tries to exceed 512MB

Use case: In Kubernetes, requests maps to reservation, limits maps to hard limit. Set requests to typical usage, limits to peak + buffer.

I typically set reservation to 50-70% of limit for services with variable memory patterns."

Q2: A Java container keeps getting OOM killed despite setting -Xmx512m in a 512MB container. Why?

Strong Answer:

"JVM uses more than just heap memory. The OOM is because total JVM memory exceeds container limit.

JVM memory components:

Total JVM Memory = Heap (-Xmx)
                 + Metaspace (class metadata, ~64MB+)
                 + Thread stacks (~1MB per thread)
                 + Code cache (JIT compiled code, ~48MB)
                 + Native memory (NIO buffers, JNI)
                 + GC overhead

With -Xmx512m:

Heap: 512MB
Metaspace: 64MB
50 threads: 50MB
Code cache: 48MB
Other: varies
Total: ~700MB in a 512MB container = OOM

Solutions:

Reduce heap, leave room for overhead:
```
BASH
Code
Loading syntax highlighter...
```
Use container-aware settings:
```
BASH
Code
Loading syntax highlighter...
```
JVM calculates 75% of container limit automatically.

Limit other areas:

BASH(3 lines)
Code
Loading syntax highlighter...

Rule of thumb: Heap should be 65-75% of container limit maximum."

Q3: How do you detect and handle CPU throttling?

Strong Answer:

"CPU throttling happens when a container exceeds its CPU quota within a scheduling period.

Detection:

BASH(4 lines)
Code
Loading syntax highlighter...

Symptoms:

High latency spikes
Lower than expected throughput
docker stats showing less CPU usage than limit

Monitoring:

Prometheus: container_cpu_cfs_throttled_periods_total
Alert when throttling rate exceeds threshold

Solutions:

Increase CPU limit if legitimate need:
```
BASH
Code
Loading syntax highlighter...
```

Adjust period for lower latency:

BASH(2 lines)
Code
Loading syntax highlighter...

Use CPU shares for burst capability:

BASH(2 lines)
Code
Loading syntax highlighter...

Optimize application:
- Reduce CPU-intensive operations
- Add caching
- Async processing

For latency-sensitive services, I prefer slightly over-provisioning CPU to avoid any throttling. For batch jobs, throttling is acceptable."

Q4: How do you right-size containers for a new application?

Strong Answer:

"I follow a data-driven process:

Phase 1: Profile without limits

BASH(2 lines)
Code
Loading syntax highlighter...

Phase 2: Analyze memory

Peak memory usage
Memory growth pattern
GC behavior (for JVM)

Phase 3: Analyze CPU

Average utilization
Peak spikes
Throttling sensitivity

Phase 4: Calculate limits

Memory limit = Peak usage * 1.2 to 1.3 (20-30% buffer)
Memory reservation = Average usage

CPU limit = Peak requirement + 20% buffer
CPU request = Average usage

Phase 5: Test under various conditions

Cold start
Sustained load
Spike load
Memory leak scenarios

Phase 6: Monitor and iterate

Track actual usage vs limits
Adjust based on real production data
Set up alerts for approaching limits

Example output:

YAML(7 lines)
Code
Loading syntax highlighter...

Key principle: Don't guess, measure. Profile in staging, validate in production, iterate."

Q5: What's the difference between --cpus, --cpu-shares, and --cpuset-cpus?

Strong Answer:

"These control different aspects of CPU allocation:

--cpus (Hard quota):

Limits total CPU time
--cpus=1.5 means container can use 150% of one core
Enforced regardless of host CPU availability
Container is throttled if it tries to use more
Use when: Predictable resource allocation, multi-tenant environments

--cpu-shares (Relative weight):

Default is 1024
Only matters under contention
--cpu-shares=512 gets half the CPU of a 1024 container
Can burst to full CPU if no contention
Use when: Batch jobs, where burst capability is valuable

--cpuset-cpus (Core pinning):

Restricts to specific CPU cores
--cpuset-cpus=0,1 only runs on cores 0 and 1
Good for: NUMA locality, cache optimization, isolating workloads
Use when: Latency-sensitive apps, NUMA-aware deployments

Combined example:

BASH(5 lines)
Code
Loading syntax highlighter...

This container:

Limited to 1.5 CPUs worth of time
Gets priority over default containers when competing
Only scheduled on cores 0 and 1"

📝 Summary & Key Takeaways

Resource Limits Quick Reference

Resource	Flag	Enforcement
Memory hard limit	`--memory`	OOM kill
Memory soft limit	`--memory-reservation`	Reclaim under pressure
CPU quota	`--cpus`	Throttling
CPU weight	`--cpu-shares`	Proportional sharing
CPU pinning	`--cpuset-cpus`	Core restriction

Application Tuning Summary

Runtime	Memory Flag	Recommendation
JVM	`-XX:MaxRAMPercentage=75.0`	Leave 25% for non-heap
Node.js	`--max-old-space-size=N`	Set to ~75% of limit in MB
Python	N/A	Generally automatic

Golden Rules

Always set limits in production - Prevent runaway containers
Leave headroom - Heap ≠ container limit
Monitor throttling - It causes latency, not just reduced throughput
Profile first - Measure, don't guess
Make apps container-aware - JVM, Node need explicit configuration

📋 Quick Reference

Common Commands

BASH(15 lines)
Code
Loading syntax highlighter...

JVM Flags

BASH(4 lines)
Code
Loading syntax highlighter...

Exit Codes

Code	Meaning
137	SIGKILL (OOM or docker kill)
143	SIGTERM (graceful shutdown)
139	SIGSEGV (segmentation fault)

📅 Review Schedule

Day	Task	Time
Day 1	Review memory/CPU limit flags	10 min
Day 3	Do Exercise 1 (observe OOM)	15 min
Day 7	Configure JVM for container in real project	20 min
Day 14	Profile and right-size a service	30 min
Day 30	Audit all services for resource limits	30 min

Previous	Current	Next
Part 8: Build Configuration	Part 9: Resource Management	Part 10: Volumes & Storage

Docker Compendium Series:

Part 0: How to Use This Series
Part 1: Container Internals
Part 2: Image Anatomy
Part 3: Build Process Deep Dive
Part 4: Networking Internals
Part 5: Dockerfile Optimization Patterns
Part 6: Multi-Stage Builds: Beyond Basics
Part 7: Base Image Selection & Security
Part 8: ARG, ENV & Build-Time Configuration
Part 9: Container Resource Management ← You are here
Part 10: Volume Patterns & Data Persistence

📋 At a Glance

🎯 What You'll Learn

🔥 Production Story: The OOM Serial Killer

🧠 Mental Model: Resource Limits Stack

🔬 Deep Dive

Memory Limits

CPU Limits

JVM Container Tuning

Node.js Container Tuning

Python Container Tuning

Monitoring Container Resources

Right-Sizing Containers

⚠️ Common Mistakes

Mistake 1: Setting Heap = Container Limit

Mistake 2: Ignoring CPU Throttling

Mistake 3: Not Setting Any Limits

🐛 Debug This: The Mysterious Slow Container

💻 Exercises

Exercise 1: Observe OOM Killer

Exercise 2: Measure CPU Throttling

Exercise 3: JVM Container Sizing

Exercise 4: Find the Right Limit

Exercise 5: Complete Resource Configuration

🎤 Senior-Level Interview Questions

Q1: Explain the difference between --memory and --memory-reservation.

Q2: A Java container keeps getting OOM killed despite setting -Xmx512m in a 512MB container. Why?

Q3: How do you detect and handle CPU throttling?

Q4: How do you right-size containers for a new application?

Q5: What's the difference between --cpus, --cpu-shares, and --cpuset-cpus?

📝 Summary & Key Takeaways

Resource Limits Quick Reference

Application Tuning Summary

Golden Rules

📋 Quick Reference

Common Commands

JVM Flags

Exit Codes

📅 Review Schedule

📚 Series Navigation

Tags: