Container Internals: What's Really Running
docker run nginx and something happens. But what exactly? This article pulls back the curtain on containers - they're not VMs, not magic, just Linux processes with clever isolation. Understanding this changes how you debug, optimize, and secure containers.📋 At a Glance
| Aspect | Details |
|---|---|
| Topic | Linux namespaces, cgroups, container runtime |
| Complexity | Advanced |
| Prerequisites | Basic Linux process model, basic Docker usage |
| Key Insight | Containers are processes with isolation, not lightweight VMs |
| Time to Master | 3-4 hours |
🎯 What You'll Learn
- Linux Namespaces - how Docker isolates processes, networks, and filesystems
- Control Groups (cgroups) - how resource limits actually work
- The container lifecycle - what happens from
docker runto process exit - OCI Runtime - the actual binary that creates containers
- Why this matters - debugging and optimization implications
🔥 Production Story: The Zombie Process Apocalypse
A team deployed a Python application processing video files. Each request spawned FFmpeg subprocesses. After a week in production, the container showed 1,247 processes - most were zombies (defunct processes).
- Container memory slowly grew
- Process table filled up
- Eventually: "fork: Resource temporarily unavailable"
docker run --init nginxThis injects a minimal init process (tini) as PID 1 that properly reaps zombies. Without understanding PID namespaces and init responsibilities, this bug was "unexplainable."
🧠 Mental Model: Containers = Processes + Isolation
Forget "lightweight VMs." A container is:
┌────────────────────────────────────────────────────────────────────┐ │ HOST KERNEL │ │ (shared by all containers) │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ Container A │ │ Container B │ │ │ │ │ │ │ │ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │ │ │ │ NAMESPACES │ │ │ │ NAMESPACES │ │ Isolation │ │ │ │ ┌──────────┐ │ │ │ │ ┌──────────┐ │ │ │ │ │ │ │ PID: 1 │ │ │ │ │ │ PID: 1 │ │ │ │ │ │ │ │ NET: own │ │ │ │ │ │ NET: own │ │ │ │ │ │ │ │ MNT: own │ │ │ │ │ │ MNT: own │ │ │ │ │ │ │ │ UTS: own │ │ │ │ │ │ UTS: own │ │ │ │ │ │ │ └──────────┘ │ │ │ │ └──────────┘ │ │ │ │ │ └────────────────┘ │ │ └────────────────┘ │ │ │ │ │ │ │ │ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │ │ │ │ CGROUPS │ │ │ │ CGROUPS │ │ Resource │ │ │ │ mem: 512MB │ │ │ │ mem: 1GB │ │ Limits │ │ │ │ cpu: 0.5 │ │ │ │ cpu: 2.0 │ │ │ │ │ └────────────────┘ │ │ └────────────────┘ │ │ │ │ │ │ │ │ │ │ ┌────────────────┐ │ │ ┌────────────────┐ │ │ │ │ │ FILESYSTEM │ │ │ │ FILESYSTEM │ │ Overlay │ │ │ │ (overlay) │ │ │ │ (overlay) │ │ FS │ │ │ └────────────────┘ │ │ └────────────────┘ │ │ │ └──────────────────────┘ └──────────────────────┘ │ │ │ │ Host PID 1234 ─────────────────── Host PID 5678 │ │ (nginx in Container A) (java in Container B) │ │ │ └─────────────────────────────────────────────────────────────────────┘
| Aspect | Virtual Machine | Container |
|---|---|---|
| Isolation | Hardware level | OS level |
| Kernel | Own kernel | Shared with host |
| Boot time | Minutes | Milliseconds |
| Memory overhead | GBs | MBs |
| Process visibility | None | Host sees all |
🔬 Deep Dive
Linux Namespaces: The Isolation Mechanism
Namespaces give each container its own view of system resources. There are 8 namespace types:
| Namespace | Flag | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs - container sees PID 1 |
| Network | CLONE_NEWNET | Network interfaces, routing, ports |
| Mount | CLONE_NEWNS | Filesystem mount points |
| UTS | CLONE_NEWUTS | Hostname and domain name |
| IPC | CLONE_NEWIPC | System V IPC, message queues |
| User | CLONE_NEWUSER | User and group IDs |
| Cgroup | CLONE_NEWCGROUP | Cgroup root directory |
| Time | CLONE_NEWTIME | System clocks (Linux 5.6+) |
PID Namespace Deep Dive
BASH(10 lines)CodeLoading syntax highlighter...
- Host PID: 1234 (real)
- Container PID: 1 (virtualized)
kill 1 inside a container doesn't kill the host's init - different PID namespace.- Reap orphaned child processes (zombie cleanup)
- Forward signals to children
- Handle SIGTERM gracefully for container shutdown
Most applications aren't designed to be PID 1. Solutions:
docker run --init- injects tini as PID 1- Use dumb-init or tini in your Dockerfile
- Properly handle signals in your app
Network Namespace Deep Dive
Each container gets its own:
- Network interfaces (eth0, lo)
- Routing tables
- iptables rules
- Port bindings
BASH(11 lines)CodeLoading syntax highlighter...
┌─────────────────────────────────────────────────────┐ │ HOST │ │ │ │ ┌────────────┐ ┌────────────┐ │ │ │ Container │ │ docker0 │ │ │ │ │ │ bridge │ │ │ │ eth0 ─────┼─ veth ──┼─ 172.17.0.1│── eth0 ──────┼──► Internet │ │ 172.17.0.2 │ pair │ │ │ │ │ │ │ │ │ │ └────────────┘ └────────────┘ │ │ │ └─────────────────────────────────────────────────────┘
The veth pair is like a virtual network cable connecting the container's namespace to the host's docker0 bridge.
Mount Namespace Deep Dive
Containers see a different filesystem than the host:
BASH(7 lines)CodeLoading syntax highlighter...
BASH(2 lines)CodeLoading syntax highlighter...
Control Groups (cgroups): Resource Limits
Namespaces provide isolation. Cgroups provide resource limits.
| Controller | Limits |
|---|---|
memory | RAM usage, swap |
cpu | CPU time allocation |
cpuset | Which CPUs can be used |
blkio | Disk I/O bandwidth |
pids | Maximum number of processes |
devices | Device access |
cgroup v1 vs v2
BASH(6 lines)CodeLoading syntax highlighter...
- Single unified hierarchy
- Cleaner resource distribution
- Better pressure stall information (PSI)
- Required for some K8s features
Memory Limits in Practice
BASH(6 lines)CodeLoading syntax highlighter...
┌─────────────────────────────────────────────────────┐ │ Container Memory Usage │ │ │ │ ┌─────────────────────────────────────────────────┐│ │ │ ││ │ │ ████████████████████░░░░░░░░░░░ 450MB ││ │ │ ││ │ │ ─────────────────────────────── 512MB limit ││ │ └─────────────────────────────────────────────────┘│ │ │ │ If exceeds limit: │ │ 1. Kernel tries to reclaim memory │ │ 2. If can't: OOM killer activates │ │ 3. Container's main process killed (SIGKILL) │ │ 4. Exit code: 137 (128 + 9) │ └─────────────────────────────────────────────────────┘
CPU Limits in Practice
Two main mechanisms:
BASH(8 lines)CodeLoading syntax highlighter...
| Scenario | Use | Why |
|---|---|---|
| Predictable workload | --cpus | Guaranteed max |
| Batch jobs | --cpu-shares | Share fairly |
| Latency-sensitive | --cpuset-cpus | Dedicated cores |
What Happens During docker run
docker run -d -p 8080:80 nginx:┌─────────────────────────────────────────────────────────────────┐ │ docker run -d nginx │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 1. Docker CLI → Docker Daemon (dockerd) │ │ - Parse arguments │ │ - Validate image exists (or pull) │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 2. Docker Daemon → containerd │ │ - Create container metadata │ │ - Prepare rootfs (overlay mount) │ │ - Generate OCI runtime spec (config.json) │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 3. containerd → containerd-shim → runc │ │ - shim: Keeps container alive if containerd restarts │ │ - runc: Actually creates the container │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 4. runc creates container: │ │ a. Create namespaces (clone syscall with namespace flags) │ │ b. Set up cgroups │ │ c. Configure seccomp filters │ │ d. Set capabilities │ │ e. Pivot root to container filesystem │ │ f. Execute entrypoint (nginx) │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ 5. Container running! │ │ - Process running in isolated namespaces │ │ - Resource limits enforced by cgroups │ │ - containerd-shim monitors for exit │ └─────────────────────────────────────────────────────────────────┘
| Component | Role |
|---|---|
| Docker CLI | User interface, command parsing |
| dockerd | Image management, networking, volumes |
| containerd | Container lifecycle, image transfer |
| containerd-shim | Keep container running across daemon restarts |
| runc | Actually create container (OCI runtime) |
The OCI Runtime Specification
OCI (Open Container Initiative) defines standards for containers. The runtime spec defines:
- config.json - Container configuration
JSON(25 lines)CodeLoading syntax highlighter...
- rootfs/ - The container's filesystem
This standardization means you can use different runtimes:
- runc - Default, reference implementation
- crun - Faster, written in C
- gVisor (runsc) - Sandboxed, intercepts syscalls
- Kata Containers - VM-level isolation
Verifying Isolation from the Host
You can peek inside namespaces from the host:
BASH(17 lines)CodeLoading syntax highlighter...
⚠️ Common Mistakes
Mistake 1: Treating Containers Like VMs
BASH(10 lines)CodeLoading syntax highlighter...
Mistake 2: Ignoring PID 1 Responsibilities
BASH(13 lines)CodeLoading syntax highlighter...
Mistake 3: Assuming Container Isolation = Security
BASH(14 lines)CodeLoading syntax highlighter...
🐛 Debug This: The Invisible Process
top on the host shows it as PID 45678. When I try to kill 45678 from another container, it says 'No such process'. What's going on?"BASH(12 lines)CodeLoading syntax highlighter...
BASH(10 lines)CodeLoading syntax highlighter...
💻 Exercises
Exercise 1: Explore Your Container's Namespaces
⭐ Difficulty: Easy | ⏱️ Time: 15 minutes
BASH(13 lines)CodeLoading syntax highlighter...
Exercise 2: Watch OOM Killer in Action
⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes
BASH(23 lines)CodeLoading syntax highlighter...
Exercise 3: Create Container "By Hand" with unshare
⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes
Create a minimal container without Docker using Linux primitives:
BASH(19 lines)CodeLoading syntax highlighter...
ps work? Fix it by creating the necessary directories.Exercise 4: Measure Namespace Overhead
⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 25 minutes
BASH(18 lines)CodeLoading syntax highlighter...
Exercise 5: Debug Zombie Processes
⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes
BASH(33 lines)CodeLoading syntax highlighter...
🎤 Senior-Level Interview Questions
Q1: What's the difference between a container and a VM?
"The fundamental difference is the isolation level. VMs virtualize hardware - each VM runs its own kernel, booting a full OS. The hypervisor manages hardware sharing.
Containers virtualize the OS - they share the host kernel but use Linux namespaces for isolation and cgroups for resource limits. This means:
- Startup: Containers start in milliseconds (just a process), VMs in seconds/minutes (full boot)
- Overhead: Containers use MBs of memory overhead, VMs use GBs
- Isolation: VMs have stronger isolation (separate kernel). A kernel vulnerability affects all containers but only one VM
- Density: You can run 100s of containers on a host, maybe 10s of VMs
For most workloads, containers are better for efficiency. VMs are better when you need strong multi-tenant isolation or different operating systems."
Q2: A container process shows as PID 1 inside but PID 5000 on the host. Explain.
CLONE_NEWPID flag with the clone syscall to create a new PID namespace.Inside this namespace, PIDs start from 1 - the first process becomes PID 1. But the kernel still tracks the 'real' PID in the root namespace.
The process has two PIDs simultaneously:
- Namespace PID: 1 (what the container sees)
- Host PID: 5000 (what the kernel uses)
--init to run tini as PID 1 instead."Q3: How do you debug a container that exits immediately with code 137?
"Exit code 137 is 128 + 9, meaning the process received SIGKILL. This is almost always the OOM killer.
My debugging steps:
- Confirm OOM:
docker inspectshowsOOMKilled: true - Check memory limit: Was it set too low?
docker inspect --format '{{.HostConfig.Memory}}' - Check actual usage:
docker statsbefore it dies, or kernel logs:dmesg | grep -i 'killed process' - Analyze the app: Is it a memory leak, or does it legitimately need more?
- Fix options:
- Increase limit if app needs it
- Fix memory leak if there is one
- For JVM: ensure
-XX:+UseContainerSupportand set-XX:MaxRAMPercentage - Add memory monitoring/alerting before OOM
The key insight: OOM kills are silent from the container's perspective. No logs, no graceful shutdown, just SIGKILL."
Q4: Why might a process inside a container not see other processes also inside that container?
"If a container is running multiple processes and one can't see the others, there are a few possibilities:
-
Different PID namespaces: If a process was started with additional isolation, it might have its own PID namespace. Docker
--initdoes this correctly, but custom setups might not. -
Proc filesystem not mounted: If
/procisn't mounted (or mounted incorrectly),pswon't show processes. This happens in minimal containers. -
Incorrect /proc mount: If
/procwas mounted from outside the namespace, it shows the wrong namespace's processes. -
Process actually died: Race condition - process started but exited before
psran.
/proc is mounted correctly with mount | grep proc. Use ls /proc to see if PIDs are visible. Check if there are multiple PID namespaces with lsns -t pid."Q5: Explain what happens between docker run nginx and nginx actually serving requests.
"The full sequence involves multiple components:
-
Docker CLI parses the command, communicates with Docker daemon via REST API over Unix socket
-
Docker daemon (dockerd):
- Checks image exists locally, pulls if not
- Creates container metadata
- Sets up networking (creates veth pair, connects to bridge)
- Prepares storage (overlay mount)
-
Containerd:
- Receives request from dockerd
- Creates OCI runtime bundle (config.json + rootfs)
- Spawns containerd-shim
-
Containerd-shim:
- Stays running to monitor container
- Survives containerd restarts
- Calls runc
-
Runc (OCI runtime):
- Creates namespaces (clone syscall)
- Sets up cgroups
- Applies seccomp filters
- Sets capabilities
- Performs pivot_root to container's filesystem
- Execs nginx
-
Nginx starts as PID 1 in container (PID N on host)
-
Network path:
- Nginx binds to port 80 inside container
- Docker sets up iptables DNAT rule
- External traffic to host:8080 → container:80
Total time: ~100-500ms for a cached image."
📝 Summary & Key Takeaways
Core Concepts
| Concept | Key Point |
|---|---|
| Containers | Processes with isolation, not VMs |
| Namespaces | Provide resource isolation (PID, network, mount, etc.) |
| Cgroups | Provide resource limits (memory, CPU, I/O) |
| PID 1 | Has special responsibilities - use --init |
| OCI Runtime | Standardized container creation (runc) |
The Container Equation
Container = Linux Process + Namespaces (isolation) + Cgroups (resource limits) + Overlay filesystem (image)
What You Can Do Now
- Debug namespace issues: Use
nsenterandlsnsto inspect containers - Understand resource limits: Know why OOM happens (exit code 137)
- Avoid zombie processes: Use
--initor proper signal handling - Explain containers technically: Move beyond "lightweight VMs" explanation
📋 Quick Reference
Namespace Types
| Namespace | Creates Isolation For |
|---|---|
| pid | Process IDs |
| net | Network stack |
| mnt | Mount points |
| uts | Hostname |
| ipc | IPC resources |
| user | User/group IDs |
| cgroup | Cgroup root |
| time | System clocks |
Key Commands
BASH(14 lines)CodeLoading syntax highlighter...
Exit Codes
| Code | Signal | Meaning |
|---|---|---|
| 0 | - | Normal exit |
| 1 | - | Application error |
| 137 | SIGKILL (9) | OOM killed or docker kill |
| 143 | SIGTERM (15) | Graceful shutdown |
📅 Review Schedule
| Day | Task | Time |
|---|---|---|
| Day 1 | Re-read namespace section, draw diagram from memory | 10 min |
| Day 3 | Do Exercise 1 (explore namespaces) | 15 min |
| Day 7 | Explain container vs VM to a colleague | 5 min |
| Day 14 | Do Exercise 2 (OOM killer) | 20 min |
| Day 30 | Answer interview questions without looking | 15 min |
📚 Series Navigation
| Previous | Current | Next |
|---|---|---|
| Part 0: How to Use This Series | Part 1: Container Internals | Part 2: Image Anatomy |
- Part 0: How to Use This Series
- Part 1: Container Internals ← You are here
- Part 2: Image Anatomy
- Part 3: Build Process Deep Dive
- Part 4: Networking Internals