Devops
Debugging Containers
📋 At a Glance
| Aspect | Details |
|---|---|
| Difficulty | 🟡 Intermediate |
| Prerequisites | Part 1 (Container Internals), Part 11 (Logging) |
| Key Tools | exec, logs, inspect, nsenter, strace, tcpdump |
| Time Investment | 30 minutes read + 60 minutes practice |
| Payoff | Debug any container issue in minutes, not hours |
🎯 What You'll Learn
After this article, you'll be able to:
- Debug running containers using exec, logs, and inspect effectively
- Investigate crashed containers and extract information from dead containers
- Use advanced tools like nsenter and strace for deep debugging
- Debug network issues with tcpdump and netstat inside containers
- Perform post-mortem analysis on containers that won't start
🔥 Production Story: The Heisenberg Container
The Setup: A Java microservice crashes every few hours in production. When developers SSH into the server and exec into the container, it works fine. Only crashes when nobody's watching.
The Investigation:
BASH(21 lines)CodeLoading syntax highlighter...
The Root Cause: JVM wasn't using container-aware settings. It was detecting 64GB host RAM and setting metaspace, code cache, and thread stacks accordingly. Total memory exceeded 2GB container limit.
The Fix:
DOCKERFILE(3 lines)CodeLoading syntax highlighter...
Lesson Learned: Exit code 137 always means OOM. The container worked when developers exec'd because the JVM was already initialized - their debugging session didn't trigger more memory allocation.
🧠 Mental Model: The Debugging Hierarchy
┌─────────────────────────────────────────────────────────────────┐ │ DEBUGGING LEVELS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Level 1: OBSERVATION (90% of issues) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ docker logs → Application output │ │ │ │ docker ps -a → Container state & exit codes │ │ │ │ docker inspect → Full configuration │ │ │ │ docker stats → Real-time resource usage │ │ │ └─────────────────────────────────────────────────────────┘ │ │ ↓ │ │ Level 2: INTERACTION (8% of issues) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ docker exec → Run commands inside container │ │ │ │ docker cp → Copy files in/out │ │ │ │ docker diff → See filesystem changes │ │ │ └─────────────────────────────────────────────────────────┘ │ │ ↓ │ │ Level 3: DEEP INSPECTION (2% of issues) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ nsenter → Enter container namespaces from host │ │ │ │ strace → Trace system calls │ │ │ │ tcpdump → Capture network traffic │ │ │ │ /proc → Direct process inspection │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Key Insight: Always start at Level 1. Most issues are obvious from logs and exit codes. Only escalate when simpler tools don't reveal the problem.
🔬 Deep Dive
1. Basic Debugging: logs, ps, inspect
Understanding Exit Codes:
BASH(17 lines)CodeLoading syntax highlighter...
Advanced Log Analysis:
BASH(14 lines)CodeLoading syntax highlighter...
Inspect Deep Dive:
BASH(23 lines)CodeLoading syntax highlighter...
2. Interactive Debugging with exec
Basic exec Patterns:
BASH(13 lines)CodeLoading syntax highlighter...
Debugging Without Shell (distroless images):
BASH(11 lines)CodeLoading syntax highlighter...
Process Inspection:
BASH(14 lines)CodeLoading syntax highlighter...
3. Network Debugging
Container Network Analysis:
BASH(18 lines)CodeLoading syntax highlighter...
Using netshoot for Deep Network Debug:
BASH(22 lines)CodeLoading syntax highlighter...
Debugging DNS Issues:
BASH(17 lines)CodeLoading syntax highlighter...
4. Advanced Debugging: nsenter
nsenter lets you enter container namespaces directly from the host - essential when container has no debugging tools.
BASH(16 lines)CodeLoading syntax highlighter...
Namespace-specific Debugging:
BASH(15 lines)CodeLoading syntax highlighter...
5. System Call Tracing with strace
Basic strace Usage:
BASH(18 lines)CodeLoading syntax highlighter...
Common strace Patterns:
BASH(14 lines)CodeLoading syntax highlighter...
Debugging Startup Issues:
BASH(6 lines)CodeLoading syntax highlighter...
6. Debugging Crashed Containers
Extract Information from Dead Containers:
BASH(20 lines)CodeLoading syntax highlighter...
Debugging Containers That Won't Start:
BASH(18 lines)CodeLoading syntax highlighter...
7. Resource Debugging
Memory Issues:
BASH(16 lines)CodeLoading syntax highlighter...
CPU Issues:
BASH(15 lines)CodeLoading syntax highlighter...
Disk I/O Issues:
BASH(11 lines)CodeLoading syntax highlighter...
8. Complete Debug Script
BASH(97 lines)CodeLoading syntax highlighter...
⚠️ Common Mistakes
Mistake 1: Ignoring Exit Codes
BASH(9 lines)CodeLoading syntax highlighter...
Mistake 2: Debugging in Production Without Caution
BASH(9 lines)CodeLoading syntax highlighter...
Mistake 3: Not Preserving Dead Container State
BASH(9 lines)CodeLoading syntax highlighter...
Mistake 4: Exec-ing as Wrong User
BASH(8 lines)CodeLoading syntax highlighter...
Mistake 5: Missing Container Restart Patterns
BASH(11 lines)CodeLoading syntax highlighter...
🐛 Debug This
You're on-call and get paged. Container keeps restarting:
BASH(19 lines)CodeLoading syntax highlighter...
Exit code 137 but OOMKilled is false. Memory at 60% of limit. What's happening?
Click to reveal investigation
Analysis:
- Exit 137 = SIGKILL, but not from OOM killer
- Memory usage (1.2GB) well under limit (2GB)
- Logs show normal startup, no error
Possible causes of non-OOM SIGKILL:
- Health check failing → Docker kills container
docker killordocker stoptimeout- Kubernetes liveness probe failing
- External process killing the container
Investigation:
BASH(16 lines)CodeLoading syntax highlighter...
Root Cause: The application logs "Server ready" before the socket is actually bound. Health check starts immediately, can't connect, marks unhealthy, Docker kills container.
Fix: Either fix the application to not log "ready" until socket is bound, or add startup grace period:
YAML(6 lines)CodeLoading syntax highlighter...
💻 Exercises
Exercise 1: Basic Debug Investigation
Deploy a container with an intentional issue and debug it:
BASH(4 lines)CodeLoading syntax highlighter...
Exercise 2: Network Debugging
Create two containers that should communicate but can't:
BASH(5 lines)CodeLoading syntax highlighter...
Exercise 3: Memory Investigation
BASH(9 lines)CodeLoading syntax highlighter...
Exercise 4: Deep Debug with nsenter
BASH(7 lines)CodeLoading syntax highlighter...
Exercise 5: Post-mortem Analysis
BASH(10 lines)CodeLoading syntax highlighter...
🎤 Interview Questions
Q1: A container exits with code 137. What does this mean and how do you investigate?
Answer: Exit code 137 indicates the process received SIGKILL (128 + 9). This typically means:
- OOM Killer - Most common cause:
BASH(2 lines)CodeLoading syntax highlighter...
- Health check failure - Docker kills unhealthy containers:
BASH(2 lines)CodeLoading syntax highlighter...
- External kill -
docker kill, orchestrator, or system process:
BASH(2 lines)CodeLoading syntax highlighter...
- Timeout during stop - Container didn't respond to SIGTERM:
BASHCodeLoading syntax highlighter...
Investigation steps:
- Check OOMKilled flag first
- Check health status if configured
- Check docker events for who killed it
- Check dmesg and journalctl for system-level kills
Q2: How do you debug a container that uses a distroless base image?
Answer: Distroless images have no shell or debugging tools. Strategies:
- Sidecar debug container (production-safe):
BASH(3 lines)CodeLoading syntax highlighter...
- nsenter from host (requires host access):
BASH(3 lines)CodeLoading syntax highlighter...
- Debug variant images (for development):
DOCKERFILE(2 lines)CodeLoading syntax highlighter...
- Build debug layer (for your images):
DOCKERFILE(6 lines)CodeLoading syntax highlighter...
- Export and analyze filesystem:
BASH(2 lines)CodeLoading syntax highlighter...
Q3: What's the difference between docker logs and checking /var/log inside the container?
Answer: They serve different purposes:
docker logs:
- Captures stdout/stderr from PID 1 (and children if properly configured)
- Managed by Docker logging driver
- Persists based on log rotation settings
- Works even after container stops
- Example:
docker logs --since 5m container
/var/log inside container:
- Application-specific log files
- Not captured by Docker logging
- Lost when container is removed (unless volume mounted)
- Requires exec or cp to access
- Example:
docker exec container cat /var/log/app/error.log
Best practices:
- 12-factor apps should log to stdout/stderr for Docker to capture
- Legacy apps writing to files need log forwarders or volume mounts
- Production should use centralized logging (ELK, Loki, CloudWatch)
BASH(4 lines)CodeLoading syntax highlighter...
Q4: How would you trace why a containerized application can't connect to an external service?
Answer: Systematic network debugging approach:
BASH(25 lines)CodeLoading syntax highlighter...
Common causes:
- DNS resolution failure (check /etc/resolv.conf)
- Network policy/firewall blocking egress
- Proxy configuration missing
- TLS/certificate issues
- Container network mode wrong (host vs bridge)
Q5: Explain how you would perform post-mortem analysis on a container that crashed and won't restart.
Answer: Systematic post-mortem process:
BASH(31 lines)CodeLoading syntax highlighter...
Key things to look for:
- Exit code meaning (137=SIGKILL, 139=SIGSEGV, etc.)
- Last log entries before crash
- Modified files (docker diff)
- Core dumps if segfault
- Resource exhaustion evidence
📝 Summary & Key Takeaways
Debugging Hierarchy
- Level 1: logs, ps, inspect, stats - solves 90% of issues
- Level 2: exec, cp, diff - interactive investigation
- Level 3: nsenter, strace, tcpdump - deep system analysis
Exit Code Quick Reference
| Code | Signal | Meaning |
|---|---|---|
| 0 | - | Normal exit |
| 1 | - | Application error |
| 137 | SIGKILL | OOM or forced kill |
| 139 | SIGSEGV | Segmentation fault |
| 143 | SIGTERM | Graceful shutdown |
Essential Commands
BASH(17 lines)CodeLoading syntax highlighter...
📋 Quick Reference
BASH(22 lines)CodeLoading syntax highlighter...
📅 Review Schedule
Master debugging through spaced repetition:
- Day 1: Practice logs, ps, inspect on running containers
- Day 3: Debug intentionally broken containers
- Day 7: Network debugging exercises
- Day 14: nsenter and strace deep dive
- Day 30: Full post-mortem analysis on complex failure
📚 Series Navigation
- Previous: Part 12 - Container Security Hardening
- Next: Part 14 - Compose File Deep Dive
- Index: Docker Compendium Series