Debugging Containers

📋 At a Glance

Aspect	Details
Difficulty	🟡 Intermediate
Prerequisites	Part 1 (Container Internals), Part 11 (Logging)
Key Tools	exec, logs, inspect, nsenter, strace, tcpdump
Time Investment	30 minutes read + 60 minutes practice
Payoff	Debug any container issue in minutes, not hours

🎯 What You'll Learn

After this article, you'll be able to:

Debug running containers using exec, logs, and inspect effectively
Investigate crashed containers and extract information from dead containers
Use advanced tools like nsenter and strace for deep debugging
Debug network issues with tcpdump and netstat inside containers
Perform post-mortem analysis on containers that won't start

🔥 Production Story: The Heisenberg Container

The Setup: A Java microservice crashes every few hours in production. When developers SSH into the server and exec into the container, it works fine. Only crashes when nobody's watching.

The Investigation:

BASH(21 lines)
Code
Loading syntax highlighter...

The Root Cause: JVM wasn't using container-aware settings. It was detecting 64GB host RAM and setting metaspace, code cache, and thread stacks accordingly. Total memory exceeded 2GB container limit.

The Fix:

DOCKERFILE(3 lines)
Code
Loading syntax highlighter...

Lesson Learned: Exit code 137 always means OOM. The container worked when developers exec'd because the JVM was already initialized - their debugging session didn't trigger more memory allocation.

🧠 Mental Model: The Debugging Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                    DEBUGGING LEVELS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Level 1: OBSERVATION (90% of issues)                          │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  docker logs    → Application output                    │   │
│   │  docker ps -a   → Container state & exit codes          │   │
│   │  docker inspect → Full configuration                    │   │
│   │  docker stats   → Real-time resource usage              │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   Level 2: INTERACTION (8% of issues)                           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  docker exec    → Run commands inside container         │   │
│   │  docker cp      → Copy files in/out                     │   │
│   │  docker diff    → See filesystem changes                │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   Level 3: DEEP INSPECTION (2% of issues)                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  nsenter        → Enter container namespaces from host  │   │
│   │  strace         → Trace system calls                    │   │
│   │  tcpdump        → Capture network traffic               │   │
│   │  /proc          → Direct process inspection             │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Insight: Always start at Level 1. Most issues are obvious from logs and exit codes. Only escalate when simpler tools don't reveal the problem.

🔬 Deep Dive

1. Basic Debugging: logs, ps, inspect

Understanding Exit Codes:

BASH(17 lines)
Code
Loading syntax highlighter...

Advanced Log Analysis:

BASH(14 lines)
Code
Loading syntax highlighter...

Inspect Deep Dive:

BASH(23 lines)
Code
Loading syntax highlighter...

2. Interactive Debugging with exec

Basic exec Patterns:

BASH(13 lines)
Code
Loading syntax highlighter...

Debugging Without Shell (distroless images):

BASH(11 lines)
Code
Loading syntax highlighter...

Process Inspection:

BASH(14 lines)
Code
Loading syntax highlighter...

3. Network Debugging

Container Network Analysis:

BASH(18 lines)
Code
Loading syntax highlighter...

Using netshoot for Deep Network Debug:

BASH(22 lines)
Code
Loading syntax highlighter...

Debugging DNS Issues:

BASH(17 lines)
Code
Loading syntax highlighter...

4. Advanced Debugging: nsenter

nsenter lets you enter container namespaces directly from the host - essential when container has no debugging tools.

BASH(16 lines)
Code
Loading syntax highlighter...

Namespace-specific Debugging:

BASH(15 lines)
Code
Loading syntax highlighter...

5. System Call Tracing with strace

Basic strace Usage:

BASH(18 lines)
Code
Loading syntax highlighter...

Common strace Patterns:

BASH(14 lines)
Code
Loading syntax highlighter...

Debugging Startup Issues:

BASH(6 lines)
Code
Loading syntax highlighter...

6. Debugging Crashed Containers

Extract Information from Dead Containers:

BASH(20 lines)
Code
Loading syntax highlighter...

Debugging Containers That Won't Start:

BASH(18 lines)
Code
Loading syntax highlighter...

7. Resource Debugging

Memory Issues:

BASH(16 lines)
Code
Loading syntax highlighter...

CPU Issues:

BASH(15 lines)
Code
Loading syntax highlighter...

Disk I/O Issues:

BASH(11 lines)
Code
Loading syntax highlighter...

8. Complete Debug Script

BASH(97 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Ignoring Exit Codes

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 2: Debugging in Production Without Caution

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 3: Not Preserving Dead Container State

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 4: Exec-ing as Wrong User

BASH(8 lines)
Code
Loading syntax highlighter...

Mistake 5: Missing Container Restart Patterns

BASH(11 lines)
Code
Loading syntax highlighter...

🐛 Debug This

You're on-call and get paged. Container keeps restarting:

BASH(19 lines)
Code
Loading syntax highlighter...

Exit code 137 but OOMKilled is false. Memory at 60% of limit. What's happening?

Click to reveal investigation

Analysis:

Exit 137 = SIGKILL, but not from OOM killer
Memory usage (1.2GB) well under limit (2GB)
Logs show normal startup, no error

Possible causes of non-OOM SIGKILL:

Health check failing → Docker kills container
docker kill or docker stop timeout
Kubernetes liveness probe failing
External process killing the container

Investigation:

BASH(16 lines)
Code
Loading syntax highlighter...

Root Cause: The application logs "Server ready" before the socket is actually bound. Health check starts immediately, can't connect, marks unhealthy, Docker kills container.

Fix: Either fix the application to not log "ready" until socket is bound, or add startup grace period:

YAML(6 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Basic Debug Investigation

Deploy a container with an intentional issue and debug it:

BASH(4 lines)
Code
Loading syntax highlighter...

Exercise 2: Network Debugging

Create two containers that should communicate but can't:

BASH(5 lines)
Code
Loading syntax highlighter...

Exercise 3: Memory Investigation

BASH(9 lines)
Code
Loading syntax highlighter...

Exercise 4: Deep Debug with nsenter

BASH(7 lines)
Code
Loading syntax highlighter...

Exercise 5: Post-mortem Analysis

BASH(10 lines)
Code
Loading syntax highlighter...

🎤 Interview Questions

Q1: A container exits with code 137. What does this mean and how do you investigate?

Answer: Exit code 137 indicates the process received SIGKILL (128 + 9). This typically means:

OOM Killer - Most common cause:

BASH(2 lines)
Code
Loading syntax highlighter...

Health check failure - Docker kills unhealthy containers:

BASH(2 lines)
Code
Loading syntax highlighter...

External kill - docker kill, orchestrator, or system process:

BASH(2 lines)
Code
Loading syntax highlighter...

Timeout during stop - Container didn't respond to SIGTERM:

BASH
Code
Loading syntax highlighter...

Investigation steps:

Check OOMKilled flag first
Check health status if configured
Check docker events for who killed it
Check dmesg and journalctl for system-level kills

Q2: How do you debug a container that uses a distroless base image?

Answer: Distroless images have no shell or debugging tools. Strategies:

Sidecar debug container (production-safe):

BASH(3 lines)
Code
Loading syntax highlighter...

nsenter from host (requires host access):

BASH(3 lines)
Code
Loading syntax highlighter...

Debug variant images (for development):

DOCKERFILE(2 lines)
Code
Loading syntax highlighter...

Build debug layer (for your images):

DOCKERFILE(6 lines)
Code
Loading syntax highlighter...

Export and analyze filesystem:

BASH(2 lines)
Code
Loading syntax highlighter...

Q3: What's the difference between docker logs and checking /var/log inside the container?

Answer: They serve different purposes:

docker logs:

Captures stdout/stderr from PID 1 (and children if properly configured)
Managed by Docker logging driver
Persists based on log rotation settings
Works even after container stops
Example: docker logs --since 5m container

/var/log inside container:

Application-specific log files
Not captured by Docker logging
Lost when container is removed (unless volume mounted)
Requires exec or cp to access
Example: docker exec container cat /var/log/app/error.log

Best practices:

12-factor apps should log to stdout/stderr for Docker to capture
Legacy apps writing to files need log forwarders or volume mounts
Production should use centralized logging (ELK, Loki, CloudWatch)

BASH(4 lines)
Code
Loading syntax highlighter...

Q4: How would you trace why a containerized application can't connect to an external service?

Answer: Systematic network debugging approach:

BASH(25 lines)
Code
Loading syntax highlighter...

Common causes:

DNS resolution failure (check /etc/resolv.conf)
Network policy/firewall blocking egress
Proxy configuration missing
TLS/certificate issues
Container network mode wrong (host vs bridge)

Q5: Explain how you would perform post-mortem analysis on a container that crashed and won't restart.

Answer: Systematic post-mortem process:

BASH(31 lines)
Code
Loading syntax highlighter...

Key things to look for:

Exit code meaning (137=SIGKILL, 139=SIGSEGV, etc.)
Last log entries before crash
Modified files (docker diff)
Core dumps if segfault
Resource exhaustion evidence

📝 Summary & Key Takeaways

Debugging Hierarchy

Level 1: logs, ps, inspect, stats - solves 90% of issues
Level 2: exec, cp, diff - interactive investigation
Level 3: nsenter, strace, tcpdump - deep system analysis

Exit Code Quick Reference

Code	Signal	Meaning
0	-	Normal exit
1	-	Application error
137	SIGKILL	OOM or forced kill
139	SIGSEGV	Segmentation fault
143	SIGTERM	Graceful shutdown

Essential Commands

BASH(17 lines)
Code
Loading syntax highlighter...

📋 Quick Reference

BASH(22 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

Master debugging through spaced repetition:

Day 1: Practice logs, ps, inspect on running containers
Day 3: Debug intentionally broken containers
Day 7: Network debugging exercises
Day 14: nsenter and strace deep dive
Day 30: Full post-mortem analysis on complex failure

Previous: Part 12 - Container Security Hardening
Next: Part 14 - Compose File Deep Dive
Index: Docker Compendium Series