Devops

Debugging Containers

📋 At a Glance

AspectDetails
Difficulty🟡 Intermediate
PrerequisitesPart 1 (Container Internals), Part 11 (Logging)
Key Toolsexec, logs, inspect, nsenter, strace, tcpdump
Time Investment30 minutes read + 60 minutes practice
PayoffDebug any container issue in minutes, not hours

🎯 What You'll Learn

After this article, you'll be able to:

  1. Debug running containers using exec, logs, and inspect effectively
  2. Investigate crashed containers and extract information from dead containers
  3. Use advanced tools like nsenter and strace for deep debugging
  4. Debug network issues with tcpdump and netstat inside containers
  5. Perform post-mortem analysis on containers that won't start

🔥 Production Story: The Heisenberg Container

The Setup: A Java microservice crashes every few hours in production. When developers SSH into the server and exec into the container, it works fine. Only crashes when nobody's watching.
The Investigation:
BASH(21 lines)
Code
Loading syntax highlighter...
The Root Cause: JVM wasn't using container-aware settings. It was detecting 64GB host RAM and setting metaspace, code cache, and thread stacks accordingly. Total memory exceeded 2GB container limit.
The Fix:
DOCKERFILE(3 lines)
Code
Loading syntax highlighter...
Lesson Learned: Exit code 137 always means OOM. The container worked when developers exec'd because the JVM was already initialized - their debugging session didn't trigger more memory allocation.

🧠 Mental Model: The Debugging Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                    DEBUGGING LEVELS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Level 1: OBSERVATION (90% of issues)                          │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  docker logs    → Application output                    │   │
│   │  docker ps -a   → Container state & exit codes          │   │
│   │  docker inspect → Full configuration                    │   │
│   │  docker stats   → Real-time resource usage              │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   Level 2: INTERACTION (8% of issues)                           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  docker exec    → Run commands inside container         │   │
│   │  docker cp      → Copy files in/out                     │   │
│   │  docker diff    → See filesystem changes                │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↓                                     │
│   Level 3: DEEP INSPECTION (2% of issues)                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  nsenter        → Enter container namespaces from host  │   │
│   │  strace         → Trace system calls                    │   │
│   │  tcpdump        → Capture network traffic               │   │
│   │  /proc          → Direct process inspection             │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Key Insight: Always start at Level 1. Most issues are obvious from logs and exit codes. Only escalate when simpler tools don't reveal the problem.

🔬 Deep Dive

1. Basic Debugging: logs, ps, inspect

Understanding Exit Codes:
BASH(17 lines)
Code
Loading syntax highlighter...
Advanced Log Analysis:
BASH(14 lines)
Code
Loading syntax highlighter...
Inspect Deep Dive:
BASH(23 lines)
Code
Loading syntax highlighter...

2. Interactive Debugging with exec

Basic exec Patterns:
BASH(13 lines)
Code
Loading syntax highlighter...
Debugging Without Shell (distroless images):
BASH(11 lines)
Code
Loading syntax highlighter...
Process Inspection:
BASH(14 lines)
Code
Loading syntax highlighter...

3. Network Debugging

Container Network Analysis:
BASH(18 lines)
Code
Loading syntax highlighter...
Using netshoot for Deep Network Debug:
BASH(22 lines)
Code
Loading syntax highlighter...
Debugging DNS Issues:
BASH(17 lines)
Code
Loading syntax highlighter...

4. Advanced Debugging: nsenter

nsenter lets you enter container namespaces directly from the host - essential when container has no debugging tools.
BASH(16 lines)
Code
Loading syntax highlighter...
Namespace-specific Debugging:
BASH(15 lines)
Code
Loading syntax highlighter...

5. System Call Tracing with strace

Basic strace Usage:
BASH(18 lines)
Code
Loading syntax highlighter...
Common strace Patterns:
BASH(14 lines)
Code
Loading syntax highlighter...
Debugging Startup Issues:
BASH(6 lines)
Code
Loading syntax highlighter...

6. Debugging Crashed Containers

Extract Information from Dead Containers:
BASH(20 lines)
Code
Loading syntax highlighter...
Debugging Containers That Won't Start:
BASH(18 lines)
Code
Loading syntax highlighter...

7. Resource Debugging

Memory Issues:
BASH(16 lines)
Code
Loading syntax highlighter...
CPU Issues:
BASH(15 lines)
Code
Loading syntax highlighter...
Disk I/O Issues:
BASH(11 lines)
Code
Loading syntax highlighter...

8. Complete Debug Script

BASH(97 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Ignoring Exit Codes

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 2: Debugging in Production Without Caution

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 3: Not Preserving Dead Container State

BASH(9 lines)
Code
Loading syntax highlighter...

Mistake 4: Exec-ing as Wrong User

BASH(8 lines)
Code
Loading syntax highlighter...

Mistake 5: Missing Container Restart Patterns

BASH(11 lines)
Code
Loading syntax highlighter...

🐛 Debug This

You're on-call and get paged. Container keeps restarting:

BASH(19 lines)
Code
Loading syntax highlighter...

Exit code 137 but OOMKilled is false. Memory at 60% of limit. What's happening?

Click to reveal investigation
Analysis:
  1. Exit 137 = SIGKILL, but not from OOM killer
  2. Memory usage (1.2GB) well under limit (2GB)
  3. Logs show normal startup, no error
Possible causes of non-OOM SIGKILL:
  • Health check failing → Docker kills container
  • docker kill or docker stop timeout
  • Kubernetes liveness probe failing
  • External process killing the container
Investigation:
BASH(16 lines)
Code
Loading syntax highlighter...
Root Cause: The application logs "Server ready" before the socket is actually bound. Health check starts immediately, can't connect, marks unhealthy, Docker kills container.
Fix: Either fix the application to not log "ready" until socket is bound, or add startup grace period:
YAML(6 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Basic Debug Investigation

Deploy a container with an intentional issue and debug it:

BASH(4 lines)
Code
Loading syntax highlighter...

Exercise 2: Network Debugging

Create two containers that should communicate but can't:

BASH(5 lines)
Code
Loading syntax highlighter...

Exercise 3: Memory Investigation

BASH(9 lines)
Code
Loading syntax highlighter...

Exercise 4: Deep Debug with nsenter

BASH(7 lines)
Code
Loading syntax highlighter...

Exercise 5: Post-mortem Analysis

BASH(10 lines)
Code
Loading syntax highlighter...

🎤 Interview Questions

Q1: A container exits with code 137. What does this mean and how do you investigate?

Answer: Exit code 137 indicates the process received SIGKILL (128 + 9). This typically means:
  1. OOM Killer - Most common cause:
BASH(2 lines)
Code
Loading syntax highlighter...
  1. Health check failure - Docker kills unhealthy containers:
BASH(2 lines)
Code
Loading syntax highlighter...
  1. External kill - docker kill, orchestrator, or system process:
BASH(2 lines)
Code
Loading syntax highlighter...
  1. Timeout during stop - Container didn't respond to SIGTERM:
BASH
Code
Loading syntax highlighter...

Investigation steps:

  1. Check OOMKilled flag first
  2. Check health status if configured
  3. Check docker events for who killed it
  4. Check dmesg and journalctl for system-level kills

Q2: How do you debug a container that uses a distroless base image?

Answer: Distroless images have no shell or debugging tools. Strategies:
  1. Sidecar debug container (production-safe):
BASH(3 lines)
Code
Loading syntax highlighter...
  1. nsenter from host (requires host access):
BASH(3 lines)
Code
Loading syntax highlighter...
  1. Debug variant images (for development):
DOCKERFILE(2 lines)
Code
Loading syntax highlighter...
  1. Build debug layer (for your images):
DOCKERFILE(6 lines)
Code
Loading syntax highlighter...
  1. Export and analyze filesystem:
BASH(2 lines)
Code
Loading syntax highlighter...

Q3: What's the difference between docker logs and checking /var/log inside the container?

Answer: They serve different purposes:
docker logs:
  • Captures stdout/stderr from PID 1 (and children if properly configured)
  • Managed by Docker logging driver
  • Persists based on log rotation settings
  • Works even after container stops
  • Example: docker logs --since 5m container
/var/log inside container:
  • Application-specific log files
  • Not captured by Docker logging
  • Lost when container is removed (unless volume mounted)
  • Requires exec or cp to access
  • Example: docker exec container cat /var/log/app/error.log
Best practices:
  1. 12-factor apps should log to stdout/stderr for Docker to capture
  2. Legacy apps writing to files need log forwarders or volume mounts
  3. Production should use centralized logging (ELK, Loki, CloudWatch)
BASH(4 lines)
Code
Loading syntax highlighter...

Q4: How would you trace why a containerized application can't connect to an external service?

Answer: Systematic network debugging approach:
BASH(25 lines)
Code
Loading syntax highlighter...

Common causes:

  • DNS resolution failure (check /etc/resolv.conf)
  • Network policy/firewall blocking egress
  • Proxy configuration missing
  • TLS/certificate issues
  • Container network mode wrong (host vs bridge)

Q5: Explain how you would perform post-mortem analysis on a container that crashed and won't restart.

Answer: Systematic post-mortem process:
BASH(31 lines)
Code
Loading syntax highlighter...

Key things to look for:

  • Exit code meaning (137=SIGKILL, 139=SIGSEGV, etc.)
  • Last log entries before crash
  • Modified files (docker diff)
  • Core dumps if segfault
  • Resource exhaustion evidence

📝 Summary & Key Takeaways

Debugging Hierarchy

  1. Level 1: logs, ps, inspect, stats - solves 90% of issues
  2. Level 2: exec, cp, diff - interactive investigation
  3. Level 3: nsenter, strace, tcpdump - deep system analysis

Exit Code Quick Reference

CodeSignalMeaning
0-Normal exit
1-Application error
137SIGKILLOOM or forced kill
139SIGSEGVSegmentation fault
143SIGTERMGraceful shutdown

Essential Commands

BASH(17 lines)
Code
Loading syntax highlighter...

📋 Quick Reference

BASH(22 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

Master debugging through spaced repetition:

  • Day 1: Practice logs, ps, inspect on running containers
  • Day 3: Debug intentionally broken containers
  • Day 7: Network debugging exercises
  • Day 14: nsenter and strace deep dive
  • Day 30: Full post-mortem analysis on complex failure

📚 Series Navigation