Container Internals: What's Really Running

You run docker run nginx and something happens. But what exactly? This article pulls back the curtain on containers - they're not VMs, not magic, just Linux processes with clever isolation. Understanding this changes how you debug, optimize, and secure containers.

📋 At a Glance

Aspect	Details
Topic	Linux namespaces, cgroups, container runtime
Complexity	Advanced
Prerequisites	Basic Linux process model, basic Docker usage
Key Insight	Containers are processes with isolation, not lightweight VMs
Time to Master	3-4 hours

🎯 What You'll Learn

Linux Namespaces - how Docker isolates processes, networks, and filesystems
Control Groups (cgroups) - how resource limits actually work
The container lifecycle - what happens from docker run to process exit
OCI Runtime - the actual binary that creates containers
Why this matters - debugging and optimization implications

🔥 Production Story: The Zombie Process Apocalypse

A team deployed a Python application processing video files. Each request spawned FFmpeg subprocesses. After a week in production, the container showed 1,247 processes - most were zombies (defunct processes).

The symptoms:

Container memory slowly grew
Process table filled up
Eventually: "fork: Resource temporarily unavailable"

Root cause: The Python app wasn't reaping child processes. In a VM or on bare metal, the init system (PID 1) handles orphaned children. Inside a container, the app IS PID 1 - and Python's default signal handling doesn't reap children.

The fix: One flag: docker run --init nginx

This injects a minimal init process (tini) as PID 1 that properly reaps zombies. Without understanding PID namespaces and init responsibilities, this bug was "unexplainable."

The lesson: Containers aren't VMs. The rules are different. PID 1 has special responsibilities, and your app might not be ready for them.

🧠 Mental Model: Containers = Processes + Isolation

Forget "lightweight VMs." A container is:

┌────────────────────────────────────────────────────────────────────┐
│                         HOST KERNEL                                  │
│                    (shared by all containers)                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐               │
│  │     Container A      │    │     Container B      │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │   NAMESPACES   │  │    │  │   NAMESPACES   │  │   Isolation   │
│  │  │  ┌──────────┐  │  │    │  │  ┌──────────┐  │  │               │
│  │  │  │ PID: 1   │  │  │    │  │  │ PID: 1   │  │  │               │
│  │  │  │ NET: own │  │  │    │  │  │ NET: own │  │  │               │
│  │  │  │ MNT: own │  │  │    │  │  │ MNT: own │  │  │               │
│  │  │  │ UTS: own │  │  │    │  │  │ UTS: own │  │  │               │
│  │  │  └──────────┘  │  │    │  │  └──────────┘  │  │               │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │    CGROUPS     │  │    │  │    CGROUPS     │  │   Resource    │
│  │  │  mem: 512MB    │  │    │  │  mem: 1GB      │  │   Limits      │
│  │  │  cpu: 0.5      │  │    │  │  cpu: 2.0      │  │               │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │  FILESYSTEM    │  │    │  │  FILESYSTEM    │  │   Overlay     │
│  │  │  (overlay)     │  │    │  │  (overlay)     │  │   FS          │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  └──────────────────────┘    └──────────────────────┘               │
│                                                                     │
│   Host PID 1234 ─────────────────── Host PID 5678                   │
│   (nginx in Container A)           (java in Container B)            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key insight: Container A's "PID 1" is actually PID 1234 on the host. The kernel is shared. Only the view is isolated.

VM vs Container:

Aspect	Virtual Machine	Container
Isolation	Hardware level	OS level
Kernel	Own kernel	Shared with host
Boot time	Minutes	Milliseconds
Memory overhead	GBs	MBs
Process visibility	None	Host sees all

🔬 Deep Dive

Linux Namespaces: The Isolation Mechanism

Namespaces give each container its own view of system resources. There are 8 namespace types:

Namespace	Flag	What It Isolates
PID	`CLONE_NEWPID`	Process IDs - container sees PID 1
Network	`CLONE_NEWNET`	Network interfaces, routing, ports
Mount	`CLONE_NEWNS`	Filesystem mount points
UTS	`CLONE_NEWUTS`	Hostname and domain name
IPC	`CLONE_NEWIPC`	System V IPC, message queues
User	`CLONE_NEWUSER`	User and group IDs
Cgroup	`CLONE_NEWCGROUP`	Cgroup root directory
Time	`CLONE_NEWTIME`	System clocks (Linux 5.6+)

PID Namespace Deep Dive

BASH(10 lines)
Code
Loading syntax highlighter...

The same process has two PIDs:

Host PID: 1234 (real)
Container PID: 1 (virtualized)

This is why kill 1 inside a container doesn't kill the host's init - different PID namespace.

PID 1 special responsibilities:

Reap orphaned child processes (zombie cleanup)
Forward signals to children
Handle SIGTERM gracefully for container shutdown

Most applications aren't designed to be PID 1. Solutions:

docker run --init - injects tini as PID 1
Use dumb-init or tini in your Dockerfile
Properly handle signals in your app

Network Namespace Deep Dive

Each container gets its own:

Network interfaces (eth0, lo)
Routing tables
iptables rules
Port bindings

BASH(11 lines)
Code
Loading syntax highlighter...

How container networking works:

┌─────────────────────────────────────────────────────┐
│                      HOST                           │
│                                                     │
│  ┌────────────┐         ┌────────────┐              │
│  │ Container  │         │  docker0   │              │
│  │            │         │  bridge    │              │
│  │  eth0 ─────┼─ veth ──┼─ 172.17.0.1│── eth0 ──────┼──► Internet
│  │ 172.17.0.2 │  pair   │            │              │
│  │            │         │            │              │
│  └────────────┘         └────────────┘              │
│                                                     │
└─────────────────────────────────────────────────────┘

The veth pair is like a virtual network cable connecting the container's namespace to the host's docker0 bridge.

Mount Namespace Deep Dive

Containers see a different filesystem than the host:

BASH(7 lines)
Code
Loading syntax highlighter...

This uses overlay filesystem (covered in Part 2), mounted in the container's mount namespace.

Volumes are bind mounts into the namespace:

BASH(2 lines)
Code
Loading syntax highlighter...

Control Groups (cgroups): Resource Limits

Namespaces provide isolation. Cgroups provide resource limits.

What cgroups control:

Controller	Limits
`memory`	RAM usage, swap
`cpu`	CPU time allocation
`cpuset`	Which CPUs can be used
`blkio`	Disk I/O bandwidth
`pids`	Maximum number of processes
`devices`	Device access

cgroup v1 vs v2

BASH(6 lines)
Code
Loading syntax highlighter...

v2 advantages:

Single unified hierarchy
Cleaner resource distribution
Better pressure stall information (PSI)
Required for some K8s features

Memory Limits in Practice

BASH(6 lines)
Code
Loading syntax highlighter...

OOM behavior:

┌─────────────────────────────────────────────────────┐
│              Container Memory Usage                 │
│                                                     │
│  ┌─────────────────────────────────────────────────┐│
│  │                                                 ││
│  │  ████████████████████░░░░░░░░░░░  450MB         ││
│  │                                                 ││
│  │  ─────────────────────────────── 512MB limit    ││
│  └─────────────────────────────────────────────────┘│
│                                                     │
│  If exceeds limit:                                  │
│  1. Kernel tries to reclaim memory                  │
│  2. If can't: OOM killer activates                  │
│  3. Container's main process killed (SIGKILL)       │
│  4. Exit code: 137 (128 + 9)                        │
└─────────────────────────────────────────────────────┘

CPU Limits in Practice

Two main mechanisms:

BASH(8 lines)
Code
Loading syntax highlighter...

When to use which:

Scenario	Use	Why
Predictable workload	`--cpus`	Guaranteed max
Batch jobs	`--cpu-shares`	Share fairly
Latency-sensitive	`--cpuset-cpus`	Dedicated cores

What Happens During `docker run`

Let's trace docker run -d -p 8080:80 nginx:

┌─────────────────────────────────────────────────────────────────┐
│                    docker run -d nginx                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 1. Docker CLI → Docker Daemon (dockerd)                         │
│    - Parse arguments                                            │
│    - Validate image exists (or pull)                            │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. Docker Daemon → containerd                                   │
│    - Create container metadata                                  │
│    - Prepare rootfs (overlay mount)                             │
│    - Generate OCI runtime spec (config.json)                    │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. containerd → containerd-shim → runc                          │
│    - shim: Keeps container alive if containerd restarts         │
│    - runc: Actually creates the container                       │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. runc creates container:                                      │
│    a. Create namespaces (clone syscall with namespace flags)    │
│    b. Set up cgroups                                            │
│    c. Configure seccomp filters                                 │
│    d. Set capabilities                                          │
│    e. Pivot root to container filesystem                        │
│    f. Execute entrypoint (nginx)                                │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. Container running!                                           │
│    - Process running in isolated namespaces                     │
│    - Resource limits enforced by cgroups                        │
│    - containerd-shim monitors for exit                          │
└─────────────────────────────────────────────────────────────────┘

Component responsibilities:

Component	Role
Docker CLI	User interface, command parsing
dockerd	Image management, networking, volumes
containerd	Container lifecycle, image transfer
containerd-shim	Keep container running across daemon restarts
runc	Actually create container (OCI runtime)

The OCI Runtime Specification

OCI (Open Container Initiative) defines standards for containers. The runtime spec defines:

config.json - Container configuration

JSON(25 lines)
Code
Loading syntax highlighter...

rootfs/ - The container's filesystem

This standardization means you can use different runtimes:

runc - Default, reference implementation
crun - Faster, written in C
gVisor (runsc) - Sandboxed, intercepts syscalls
Kata Containers - VM-level isolation

Verifying Isolation from the Host

You can peek inside namespaces from the host:

BASH(17 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Treating Containers Like VMs

BASH(10 lines)
Code
Loading syntax highlighter...

Why it matters: Containers are process-centric. Multiple processes need proper init, signal handling, and logging becomes complicated.

Mistake 2: Ignoring PID 1 Responsibilities

BASH(13 lines)
Code
Loading syntax highlighter...

Mistake 3: Assuming Container Isolation = Security

BASH(14 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Invisible Process

A developer reports: "My container shows PID 1 as my app, but top on the host shows it as PID 45678. When I try to kill 45678 from another container, it says 'No such process'. What's going on?"

BASH(12 lines)
Code
Loading syntax highlighter...

Why can't Container B kill the process?

✅ Solution:

Each container has its own PID namespace. Container B can only see processes in its own namespace - it cannot see or signal processes in Container A's namespace.

The PID 45678 is the host PID. Container B has no visibility into the host's PID namespace (by default). Even if Container B knew the host PID, it couldn't send signals to it.

Ways to actually kill from Container B:

BASH(10 lines)
Code
Loading syntax highlighter...

Key insight: PID namespace isolation is bidirectional. Containers can't see each other's processes, which is both a security feature and a source of confusion.

💻 Exercises

Exercise 1: Explore Your Container's Namespaces

⭐ Difficulty: Easy | ⏱️ Time: 15 minutes

BASH(13 lines)
Code
Loading syntax highlighter...

Expected findings: PID, mount, UTS, IPC, network namespaces should differ. User namespace may be shared (unless using user namespace remapping).

Exercise 2: Watch OOM Killer in Action

⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes

BASH(23 lines)
Code
Loading syntax highlighter...

Exercise 3: Create Container "By Hand" with unshare

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes

Create a minimal container without Docker using Linux primitives:

BASH(19 lines)
Code
Loading syntax highlighter...

Challenge: Why doesn't ps work? Fix it by creating the necessary directories.

Exercise 4: Measure Namespace Overhead

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 25 minutes

BASH(18 lines)
Code
Loading syntax highlighter...

Exercise 5: Debug Zombie Processes

⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes

BASH(33 lines)
Code
Loading syntax highlighter...

🎤 Senior-Level Interview Questions

Q1: What's the difference between a container and a VM?

Strong Answer:

"The fundamental difference is the isolation level. VMs virtualize hardware - each VM runs its own kernel, booting a full OS. The hypervisor manages hardware sharing.

Containers virtualize the OS - they share the host kernel but use Linux namespaces for isolation and cgroups for resource limits. This means:

Startup: Containers start in milliseconds (just a process), VMs in seconds/minutes (full boot)
Overhead: Containers use MBs of memory overhead, VMs use GBs
Isolation: VMs have stronger isolation (separate kernel). A kernel vulnerability affects all containers but only one VM
Density: You can run 100s of containers on a host, maybe 10s of VMs

For most workloads, containers are better for efficiency. VMs are better when you need strong multi-tenant isolation or different operating systems."

Q2: A container process shows as PID 1 inside but PID 5000 on the host. Explain.

Strong Answer:

"This is PID namespace isolation. When Docker creates a container, it uses the CLONE_NEWPID flag with the clone syscall to create a new PID namespace.

Inside this namespace, PIDs start from 1 - the first process becomes PID 1. But the kernel still tracks the 'real' PID in the root namespace.

The process has two PIDs simultaneously:

Namespace PID: 1 (what the container sees)
Host PID: 5000 (what the kernel uses)

This matters because PID 1 has special responsibilities: signal handling and zombie reaping. Most applications aren't designed for this, which is why we often use --init to run tini as PID 1 instead."

Q3: How do you debug a container that exits immediately with code 137?

Strong Answer:

"Exit code 137 is 128 + 9, meaning the process received SIGKILL. This is almost always the OOM killer.

My debugging steps:

Confirm OOM: docker inspect shows OOMKilled: true
Check memory limit: Was it set too low? docker inspect --format '{{.HostConfig.Memory}}'
Check actual usage: docker stats before it dies, or kernel logs: dmesg | grep -i 'killed process'
Analyze the app: Is it a memory leak, or does it legitimately need more?
Fix options:
- Increase limit if app needs it
- Fix memory leak if there is one
- For JVM: ensure -XX:+UseContainerSupport and set -XX:MaxRAMPercentage
- Add memory monitoring/alerting before OOM

The key insight: OOM kills are silent from the container's perspective. No logs, no graceful shutdown, just SIGKILL."

Q4: Why might a process inside a container not see other processes also inside that container?

Strong Answer:

"If a container is running multiple processes and one can't see the others, there are a few possibilities:

Different PID namespaces: If a process was started with additional isolation, it might have its own PID namespace. Docker --init does this correctly, but custom setups might not.
Proc filesystem not mounted: If /proc isn't mounted (or mounted incorrectly), ps won't show processes. This happens in minimal containers.
Incorrect /proc mount: If /proc was mounted from outside the namespace, it shows the wrong namespace's processes.
Process actually died: Race condition - process started but exited before ps ran.

Debugging: Check /proc is mounted correctly with mount | grep proc. Use ls /proc to see if PIDs are visible. Check if there are multiple PID namespaces with lsns -t pid."

Q5: Explain what happens between `docker run nginx` and nginx actually serving requests.

Strong Answer:

"The full sequence involves multiple components:

Docker CLI parses the command, communicates with Docker daemon via REST API over Unix socket
Docker daemon (dockerd):
- Checks image exists locally, pulls if not
- Creates container metadata
- Sets up networking (creates veth pair, connects to bridge)
- Prepares storage (overlay mount)
Containerd:
- Receives request from dockerd
- Creates OCI runtime bundle (config.json + rootfs)
- Spawns containerd-shim
Containerd-shim:
- Stays running to monitor container
- Survives containerd restarts
- Calls runc
Runc (OCI runtime):
- Creates namespaces (clone syscall)
- Sets up cgroups
- Applies seccomp filters
- Sets capabilities
- Performs pivot_root to container's filesystem
- Execs nginx
Nginx starts as PID 1 in container (PID N on host)
Network path:
- Nginx binds to port 80 inside container
- Docker sets up iptables DNAT rule
- External traffic to host:8080 → container:80

Total time: ~100-500ms for a cached image."

📝 Summary & Key Takeaways

Core Concepts

Concept	Key Point
Containers	Processes with isolation, not VMs
Namespaces	Provide resource isolation (PID, network, mount, etc.)
Cgroups	Provide resource limits (memory, CPU, I/O)
PID 1	Has special responsibilities - use --init
OCI Runtime	Standardized container creation (runc)

The Container Equation

Container = Linux Process
          + Namespaces (isolation)
          + Cgroups (resource limits)
          + Overlay filesystem (image)

What You Can Do Now

Debug namespace issues: Use nsenter and lsns to inspect containers
Understand resource limits: Know why OOM happens (exit code 137)
Avoid zombie processes: Use --init or proper signal handling
Explain containers technically: Move beyond "lightweight VMs" explanation

📋 Quick Reference

Namespace Types

Namespace	Creates Isolation For
pid	Process IDs
net	Network stack
mnt	Mount points
uts	Hostname
ipc	IPC resources
user	User/group IDs
cgroup	Cgroup root
time	System clocks

Key Commands

BASH(14 lines)
Code
Loading syntax highlighter...

Exit Codes

Code	Signal	Meaning
0	-	Normal exit
1	-	Application error
137	SIGKILL (9)	OOM killed or `docker kill`
143	SIGTERM (15)	Graceful shutdown

📅 Review Schedule

Day	Task	Time
Day 1	Re-read namespace section, draw diagram from memory	10 min
Day 3	Do Exercise 1 (explore namespaces)	15 min
Day 7	Explain container vs VM to a colleague	5 min
Day 14	Do Exercise 2 (OOM killer)	20 min
Day 30	Answer interview questions without looking	15 min

Previous	Current	Next
Part 0: How to Use This Series	Part 1: Container Internals	Part 2: Image Anatomy

Docker Compendium Series:

Part 0: How to Use This Series
Part 1: Container Internals ← You are here
Part 2: Image Anatomy
Part 3: Build Process Deep Dive
Part 4: Networking Internals

Container Internals: What's Really Running

📋 At a Glance

🎯 What You'll Learn

🔥 Production Story: The Zombie Process Apocalypse

🧠 Mental Model: Containers = Processes + Isolation

🔬 Deep Dive

Linux Namespaces: The Isolation Mechanism

PID Namespace Deep Dive

Network Namespace Deep Dive

Mount Namespace Deep Dive

Control Groups (cgroups): Resource Limits

cgroup v1 vs v2

Memory Limits in Practice

CPU Limits in Practice

What Happens During docker run

The OCI Runtime Specification

Verifying Isolation from the Host

⚠️ Common Mistakes

Mistake 1: Treating Containers Like VMs

Mistake 2: Ignoring PID 1 Responsibilities

Mistake 3: Assuming Container Isolation = Security

🐛 Debug This: The Invisible Process

💻 Exercises

Exercise 1: Explore Your Container's Namespaces

Exercise 2: Watch OOM Killer in Action

Exercise 3: Create Container "By Hand" with unshare

Exercise 4: Measure Namespace Overhead

Exercise 5: Debug Zombie Processes

🎤 Senior-Level Interview Questions

Q1: What's the difference between a container and a VM?

Q2: A container process shows as PID 1 inside but PID 5000 on the host. Explain.

Q3: How do you debug a container that exits immediately with code 137?

Q4: Why might a process inside a container not see other processes also inside that container?

Q5: Explain what happens between docker run nginx and nginx actually serving requests.

📝 Summary & Key Takeaways

Core Concepts

The Container Equation

What You Can Do Now

📋 Quick Reference

Namespace Types

Key Commands

Exit Codes

📅 Review Schedule

📚 Series Navigation

Tags:

What Happens During `docker run`

Q5: Explain what happens between `docker run nginx` and nginx actually serving requests.