Devops

Container Internals: What's Really Running

You run docker run nginx and something happens. But what exactly? This article pulls back the curtain on containers - they're not VMs, not magic, just Linux processes with clever isolation. Understanding this changes how you debug, optimize, and secure containers.

📋 At a Glance

AspectDetails
TopicLinux namespaces, cgroups, container runtime
ComplexityAdvanced
PrerequisitesBasic Linux process model, basic Docker usage
Key InsightContainers are processes with isolation, not lightweight VMs
Time to Master3-4 hours

🎯 What You'll Learn

  • Linux Namespaces - how Docker isolates processes, networks, and filesystems
  • Control Groups (cgroups) - how resource limits actually work
  • The container lifecycle - what happens from docker run to process exit
  • OCI Runtime - the actual binary that creates containers
  • Why this matters - debugging and optimization implications

🔥 Production Story: The Zombie Process Apocalypse

A team deployed a Python application processing video files. Each request spawned FFmpeg subprocesses. After a week in production, the container showed 1,247 processes - most were zombies (defunct processes).

The symptoms:
  • Container memory slowly grew
  • Process table filled up
  • Eventually: "fork: Resource temporarily unavailable"
Root cause: The Python app wasn't reaping child processes. In a VM or on bare metal, the init system (PID 1) handles orphaned children. Inside a container, the app IS PID 1 - and Python's default signal handling doesn't reap children.
The fix: One flag: docker run --init nginx

This injects a minimal init process (tini) as PID 1 that properly reaps zombies. Without understanding PID namespaces and init responsibilities, this bug was "unexplainable."

The lesson: Containers aren't VMs. The rules are different. PID 1 has special responsibilities, and your app might not be ready for them.

🧠 Mental Model: Containers = Processes + Isolation

Forget "lightweight VMs." A container is:

┌────────────────────────────────────────────────────────────────────┐
│                         HOST KERNEL                                  │
│                    (shared by all containers)                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐               │
│  │     Container A      │    │     Container B      │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │   NAMESPACES   │  │    │  │   NAMESPACES   │  │   Isolation   │
│  │  │  ┌──────────┐  │  │    │  │  ┌──────────┐  │  │               │
│  │  │  │ PID: 1   │  │  │    │  │  │ PID: 1   │  │  │               │
│  │  │  │ NET: own │  │  │    │  │  │ NET: own │  │  │               │
│  │  │  │ MNT: own │  │  │    │  │  │ MNT: own │  │  │               │
│  │  │  │ UTS: own │  │  │    │  │  │ UTS: own │  │  │               │
│  │  │  └──────────┘  │  │    │  │  └──────────┘  │  │               │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │    CGROUPS     │  │    │  │    CGROUPS     │  │   Resource    │
│  │  │  mem: 512MB    │  │    │  │  mem: 1GB      │  │   Limits      │
│  │  │  cpu: 0.5      │  │    │  │  cpu: 2.0      │  │               │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  │                      │    │                      │               │
│  │  ┌────────────────┐  │    │  ┌────────────────┐  │               │
│  │  │  FILESYSTEM    │  │    │  │  FILESYSTEM    │  │   Overlay     │
│  │  │  (overlay)     │  │    │  │  (overlay)     │  │   FS          │
│  │  └────────────────┘  │    │  └────────────────┘  │               │
│  └──────────────────────┘    └──────────────────────┘               │
│                                                                     │
│   Host PID 1234 ─────────────────── Host PID 5678                   │
│   (nginx in Container A)           (java in Container B)            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Key insight: Container A's "PID 1" is actually PID 1234 on the host. The kernel is shared. Only the view is isolated.
VM vs Container:
AspectVirtual MachineContainer
IsolationHardware levelOS level
KernelOwn kernelShared with host
Boot timeMinutesMilliseconds
Memory overheadGBsMBs
Process visibilityNoneHost sees all

🔬 Deep Dive

Linux Namespaces: The Isolation Mechanism

Namespaces give each container its own view of system resources. There are 8 namespace types:

NamespaceFlagWhat It Isolates
PIDCLONE_NEWPIDProcess IDs - container sees PID 1
NetworkCLONE_NEWNETNetwork interfaces, routing, ports
MountCLONE_NEWNSFilesystem mount points
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCSystem V IPC, message queues
UserCLONE_NEWUSERUser and group IDs
CgroupCLONE_NEWCGROUPCgroup root directory
TimeCLONE_NEWTIMESystem clocks (Linux 5.6+)

PID Namespace Deep Dive

BASH(10 lines)
Code
Loading syntax highlighter...
The same process has two PIDs:
  • Host PID: 1234 (real)
  • Container PID: 1 (virtualized)
This is why kill 1 inside a container doesn't kill the host's init - different PID namespace.
PID 1 special responsibilities:
  1. Reap orphaned child processes (zombie cleanup)
  2. Forward signals to children
  3. Handle SIGTERM gracefully for container shutdown

Most applications aren't designed to be PID 1. Solutions:

  • docker run --init - injects tini as PID 1
  • Use dumb-init or tini in your Dockerfile
  • Properly handle signals in your app

Network Namespace Deep Dive

Each container gets its own:

  • Network interfaces (eth0, lo)
  • Routing tables
  • iptables rules
  • Port bindings
BASH(11 lines)
Code
Loading syntax highlighter...
How container networking works:
┌─────────────────────────────────────────────────────┐
│                      HOST                           │
│                                                     │
│  ┌────────────┐         ┌────────────┐              │
│  │ Container  │         │  docker0   │              │
│  │            │         │  bridge    │              │
│  │  eth0 ─────┼─ veth ──┼─ 172.17.0.1│── eth0 ──────┼──► Internet
│  │ 172.17.0.2 │  pair   │            │              │
│  │            │         │            │              │
│  └────────────┘         └────────────┘              │
│                                                     │
└─────────────────────────────────────────────────────┘

The veth pair is like a virtual network cable connecting the container's namespace to the host's docker0 bridge.

Mount Namespace Deep Dive

Containers see a different filesystem than the host:

BASH(7 lines)
Code
Loading syntax highlighter...
This uses overlay filesystem (covered in Part 2), mounted in the container's mount namespace.
Volumes are bind mounts into the namespace:
BASH(2 lines)
Code
Loading syntax highlighter...

Control Groups (cgroups): Resource Limits

Namespaces provide isolation. Cgroups provide resource limits.

What cgroups control:
ControllerLimits
memoryRAM usage, swap
cpuCPU time allocation
cpusetWhich CPUs can be used
blkioDisk I/O bandwidth
pidsMaximum number of processes
devicesDevice access

cgroup v1 vs v2

BASH(6 lines)
Code
Loading syntax highlighter...
v2 advantages:
  • Single unified hierarchy
  • Cleaner resource distribution
  • Better pressure stall information (PSI)
  • Required for some K8s features

Memory Limits in Practice

BASH(6 lines)
Code
Loading syntax highlighter...
OOM behavior:
┌─────────────────────────────────────────────────────┐
│              Container Memory Usage                 │
│                                                     │
│  ┌─────────────────────────────────────────────────┐│
│  │                                                 ││
│  │  ████████████████████░░░░░░░░░░░  450MB         ││
│  │                                                 ││
│  │  ─────────────────────────────── 512MB limit    ││
│  └─────────────────────────────────────────────────┘│
│                                                     │
│  If exceeds limit:                                  │
│  1. Kernel tries to reclaim memory                  │
│  2. If can't: OOM killer activates                  │
│  3. Container's main process killed (SIGKILL)       │
│  4. Exit code: 137 (128 + 9)                        │
└─────────────────────────────────────────────────────┘

CPU Limits in Practice

Two main mechanisms:

BASH(8 lines)
Code
Loading syntax highlighter...
When to use which:
ScenarioUseWhy
Predictable workload--cpusGuaranteed max
Batch jobs--cpu-sharesShare fairly
Latency-sensitive--cpuset-cpusDedicated cores

What Happens During docker run

Let's trace docker run -d -p 8080:80 nginx:
┌─────────────────────────────────────────────────────────────────┐
│                    docker run -d nginx                          │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 1. Docker CLI → Docker Daemon (dockerd)                         │
│    - Parse arguments                                            │
│    - Validate image exists (or pull)                            │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. Docker Daemon → containerd                                   │
│    - Create container metadata                                  │
│    - Prepare rootfs (overlay mount)                             │
│    - Generate OCI runtime spec (config.json)                    │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. containerd → containerd-shim → runc                          │
│    - shim: Keeps container alive if containerd restarts         │
│    - runc: Actually creates the container                       │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. runc creates container:                                      │
│    a. Create namespaces (clone syscall with namespace flags)    │
│    b. Set up cgroups                                            │
│    c. Configure seccomp filters                                 │
│    d. Set capabilities                                          │
│    e. Pivot root to container filesystem                        │
│    f. Execute entrypoint (nginx)                                │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 5. Container running!                                           │
│    - Process running in isolated namespaces                     │
│    - Resource limits enforced by cgroups                        │
│    - containerd-shim monitors for exit                          │
└─────────────────────────────────────────────────────────────────┘
Component responsibilities:
ComponentRole
Docker CLIUser interface, command parsing
dockerdImage management, networking, volumes
containerdContainer lifecycle, image transfer
containerd-shimKeep container running across daemon restarts
runcActually create container (OCI runtime)

The OCI Runtime Specification

OCI (Open Container Initiative) defines standards for containers. The runtime spec defines:

  1. config.json - Container configuration
JSON(25 lines)
Code
Loading syntax highlighter...
  1. rootfs/ - The container's filesystem

This standardization means you can use different runtimes:

  • runc - Default, reference implementation
  • crun - Faster, written in C
  • gVisor (runsc) - Sandboxed, intercepts syscalls
  • Kata Containers - VM-level isolation

Verifying Isolation from the Host

You can peek inside namespaces from the host:

BASH(17 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Treating Containers Like VMs

BASH(10 lines)
Code
Loading syntax highlighter...
Why it matters: Containers are process-centric. Multiple processes need proper init, signal handling, and logging becomes complicated.

Mistake 2: Ignoring PID 1 Responsibilities

BASH(13 lines)
Code
Loading syntax highlighter...

Mistake 3: Assuming Container Isolation = Security

BASH(14 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Invisible Process

A developer reports: "My container shows PID 1 as my app, but top on the host shows it as PID 45678. When I try to kill 45678 from another container, it says 'No such process'. What's going on?"
BASH(12 lines)
Code
Loading syntax highlighter...
Why can't Container B kill the process?

✅ Solution:
Each container has its own PID namespace. Container B can only see processes in its own namespace - it cannot see or signal processes in Container A's namespace.
The PID 45678 is the host PID. Container B has no visibility into the host's PID namespace (by default). Even if Container B knew the host PID, it couldn't send signals to it.
Ways to actually kill from Container B:
BASH(10 lines)
Code
Loading syntax highlighter...
Key insight: PID namespace isolation is bidirectional. Containers can't see each other's processes, which is both a security feature and a source of confusion.

💻 Exercises

Exercise 1: Explore Your Container's Namespaces

⭐ Difficulty: Easy | ⏱️ Time: 15 minutes

BASH(13 lines)
Code
Loading syntax highlighter...
Expected findings: PID, mount, UTS, IPC, network namespaces should differ. User namespace may be shared (unless using user namespace remapping).

Exercise 2: Watch OOM Killer in Action

⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes

BASH(23 lines)
Code
Loading syntax highlighter...

Exercise 3: Create Container "By Hand" with unshare

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes

Create a minimal container without Docker using Linux primitives:

BASH(19 lines)
Code
Loading syntax highlighter...
Challenge: Why doesn't ps work? Fix it by creating the necessary directories.

Exercise 4: Measure Namespace Overhead

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 25 minutes

BASH(18 lines)
Code
Loading syntax highlighter...

Exercise 5: Debug Zombie Processes

⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes

BASH(33 lines)
Code
Loading syntax highlighter...

🎤 Senior-Level Interview Questions

Q1: What's the difference between a container and a VM?

Strong Answer:

"The fundamental difference is the isolation level. VMs virtualize hardware - each VM runs its own kernel, booting a full OS. The hypervisor manages hardware sharing.

Containers virtualize the OS - they share the host kernel but use Linux namespaces for isolation and cgroups for resource limits. This means:

  1. Startup: Containers start in milliseconds (just a process), VMs in seconds/minutes (full boot)
  2. Overhead: Containers use MBs of memory overhead, VMs use GBs
  3. Isolation: VMs have stronger isolation (separate kernel). A kernel vulnerability affects all containers but only one VM
  4. Density: You can run 100s of containers on a host, maybe 10s of VMs

For most workloads, containers are better for efficiency. VMs are better when you need strong multi-tenant isolation or different operating systems."

Q2: A container process shows as PID 1 inside but PID 5000 on the host. Explain.

Strong Answer:
"This is PID namespace isolation. When Docker creates a container, it uses the CLONE_NEWPID flag with the clone syscall to create a new PID namespace.

Inside this namespace, PIDs start from 1 - the first process becomes PID 1. But the kernel still tracks the 'real' PID in the root namespace.

The process has two PIDs simultaneously:

  • Namespace PID: 1 (what the container sees)
  • Host PID: 5000 (what the kernel uses)
This matters because PID 1 has special responsibilities: signal handling and zombie reaping. Most applications aren't designed for this, which is why we often use --init to run tini as PID 1 instead."

Q3: How do you debug a container that exits immediately with code 137?

Strong Answer:

"Exit code 137 is 128 + 9, meaning the process received SIGKILL. This is almost always the OOM killer.

My debugging steps:

  1. Confirm OOM: docker inspect shows OOMKilled: true
  2. Check memory limit: Was it set too low? docker inspect --format '{{.HostConfig.Memory}}'
  3. Check actual usage: docker stats before it dies, or kernel logs: dmesg | grep -i 'killed process'
  4. Analyze the app: Is it a memory leak, or does it legitimately need more?
  5. Fix options:
    • Increase limit if app needs it
    • Fix memory leak if there is one
    • For JVM: ensure -XX:+UseContainerSupport and set -XX:MaxRAMPercentage
    • Add memory monitoring/alerting before OOM

The key insight: OOM kills are silent from the container's perspective. No logs, no graceful shutdown, just SIGKILL."

Q4: Why might a process inside a container not see other processes also inside that container?

Strong Answer:

"If a container is running multiple processes and one can't see the others, there are a few possibilities:

  1. Different PID namespaces: If a process was started with additional isolation, it might have its own PID namespace. Docker --init does this correctly, but custom setups might not.
  2. Proc filesystem not mounted: If /proc isn't mounted (or mounted incorrectly), ps won't show processes. This happens in minimal containers.
  3. Incorrect /proc mount: If /proc was mounted from outside the namespace, it shows the wrong namespace's processes.
  4. Process actually died: Race condition - process started but exited before ps ran.
Debugging: Check /proc is mounted correctly with mount | grep proc. Use ls /proc to see if PIDs are visible. Check if there are multiple PID namespaces with lsns -t pid."

Q5: Explain what happens between docker run nginx and nginx actually serving requests.

Strong Answer:

"The full sequence involves multiple components:

  1. Docker CLI parses the command, communicates with Docker daemon via REST API over Unix socket
  2. Docker daemon (dockerd):
    • Checks image exists locally, pulls if not
    • Creates container metadata
    • Sets up networking (creates veth pair, connects to bridge)
    • Prepares storage (overlay mount)
  3. Containerd:
    • Receives request from dockerd
    • Creates OCI runtime bundle (config.json + rootfs)
    • Spawns containerd-shim
  4. Containerd-shim:
    • Stays running to monitor container
    • Survives containerd restarts
    • Calls runc
  5. Runc (OCI runtime):
    • Creates namespaces (clone syscall)
    • Sets up cgroups
    • Applies seccomp filters
    • Sets capabilities
    • Performs pivot_root to container's filesystem
    • Execs nginx
  6. Nginx starts as PID 1 in container (PID N on host)
  7. Network path:
    • Nginx binds to port 80 inside container
    • Docker sets up iptables DNAT rule
    • External traffic to host:8080 → container:80

Total time: ~100-500ms for a cached image."


📝 Summary & Key Takeaways

Core Concepts

ConceptKey Point
ContainersProcesses with isolation, not VMs
NamespacesProvide resource isolation (PID, network, mount, etc.)
CgroupsProvide resource limits (memory, CPU, I/O)
PID 1Has special responsibilities - use --init
OCI RuntimeStandardized container creation (runc)

The Container Equation

Container = Linux Process
          + Namespaces (isolation)
          + Cgroups (resource limits)
          + Overlay filesystem (image)

What You Can Do Now

  1. Debug namespace issues: Use nsenter and lsns to inspect containers
  2. Understand resource limits: Know why OOM happens (exit code 137)
  3. Avoid zombie processes: Use --init or proper signal handling
  4. Explain containers technically: Move beyond "lightweight VMs" explanation

📋 Quick Reference

Namespace Types

NamespaceCreates Isolation For
pidProcess IDs
netNetwork stack
mntMount points
utsHostname
ipcIPC resources
userUser/group IDs
cgroupCgroup root
timeSystem clocks

Key Commands

BASH(14 lines)
Code
Loading syntax highlighter...

Exit Codes

CodeSignalMeaning
0-Normal exit
1-Application error
137SIGKILL (9)OOM killed or docker kill
143SIGTERM (15)Graceful shutdown

📅 Review Schedule

DayTaskTime
Day 1Re-read namespace section, draw diagram from memory10 min
Day 3Do Exercise 1 (explore namespaces)15 min
Day 7Explain container vs VM to a colleague5 min
Day 14Do Exercise 2 (OOM killer)20 min
Day 30Answer interview questions without looking15 min

📚 Series Navigation

PreviousCurrentNext
Part 0: How to Use This SeriesPart 1: Container InternalsPart 2: Image Anatomy
Docker Compendium Series: