Monitoring & Troubleshooting Production

📋 At a Glance

Aspect	Details
Difficulty	🟠 Advanced
Prerequisites	Part 11 (Logging), Part 13 (Debugging)
Stack	Prometheus, Grafana, Loki, cAdvisor
Time Investment	28 minutes read + 90 minutes practice
Payoff	Find issues before users report them

🎯 What You'll Learn

After this article, you'll be able to:

Set up metrics collection with Prometheus and cAdvisor
Build dashboards in Grafana for container monitoring
Implement centralized logging with Loki
Create alerts for production issues
Systematically troubleshoot production problems

🔥 Production Story: The Silent Memory Leak

The Setup: E-commerce site, holiday season. Everything running on Docker Compose. No monitoring "because we're small."

Day 1: Site runs fine. Day 3: Occasional slow responses. Day 5: 503 errors during peak hours. Day 7: Complete outage at 2 AM. Black Friday.

The Investigation (too late):

BASH(8 lines)
Code
Loading syntax highlighter...

Root Cause: Memory leak in API service. Without monitoring, nobody noticed the gradual increase until it was too late.

What Monitoring Would Have Shown:

Day 1: api memory = 400MB ✓
Day 3: api memory = 1.2GB ⚠️ Alert triggered
Day 5: api memory = 1.6GB (restart, investigate)

The Fix: Prometheus + Grafana + alerts. Now they catch issues before users do.

🧠 Mental Model: Observability Stack

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY PYRAMID                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DASHBOARDS                           │   │
│   │              (Grafana - Visualization)                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↑                                     │
│   ┌──────────────┬────────┴────────┬──────────────┐             │
│   │   METRICS    │     LOGS        │    TRACES    │             │
│   │  Prometheus  │     Loki        │   Jaeger     │             │
│   │              │                 │   (optional) │             │
│   └──────────────┴─────────────────┴──────────────┘             │
│                           ↑                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    COLLECTORS                           │   │
│   │  cAdvisor     node_exporter    Promtail    App metrics  │   │
│   │  (containers) (host)           (logs)      (custom)     │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↑                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DATA SOURCES                         │   │
│   │  Docker Daemon    Host OS    Application    Network     │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Flow:
  Containers → cAdvisor → Prometheus → Grafana → Alerts
                                     ↓
  Logs → Promtail → Loki → Grafana

🔬 Deep Dive

1. Metrics Collection with cAdvisor and Prometheus

Docker Compose Monitoring Stack:

YAML(94 lines)
Code
Loading syntax highlighter...

Prometheus Configuration:

YAML(40 lines)
Code
Loading syntax highlighter...

2. Key Metrics to Monitor

Container Metrics (from cAdvisor):

PROMQL(19 lines)
Code
Loading syntax highlighter...

Host Metrics (from node-exporter):

PROMQL(12 lines)
Code
Loading syntax highlighter...

3. Alert Rules

YAML(82 lines)
Code
Loading syntax highlighter...

Alertmanager Configuration:

YAML(32 lines)
Code
Loading syntax highlighter...

4. Centralized Logging with Loki

Add Loki to Monitoring Stack:

YAML(28 lines)
Code
Loading syntax highlighter...

Promtail Configuration:

YAML(28 lines)
Code
Loading syntax highlighter...

Loki Configuration:

YAML(45 lines)
Code
Loading syntax highlighter...

LogQL Queries in Grafana:

LOGQL(20 lines)
Code
Loading syntax highlighter...

5. Grafana Dashboard for Docker

Dashboard JSON (simplified):

JSON(64 lines)
Code
Loading syntax highlighter...

Grafana Provisioning:

YAML(15 lines)
Code
Loading syntax highlighter...

6. Troubleshooting Runbook

Systematic Troubleshooting Process:

┌─────────────────────────────────────────────────────────────────┐
│                  TROUBLESHOOTING FLOWCHART                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   SYMPTOM DETECTED                                              │
│         │                                                       │
│         ▼                                                       │
│   ┌─────────────────┐                                           │
│   │ Check Dashboard │ → Memory, CPU, Network, Errors            │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │  Check Logs     │ → Recent errors, patterns                 │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Check Resources │ → docker stats, df -h, free               │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Deep Dive       │ → exec, inspect, strace                   │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Take Action     │ → Restart, Scale, Rollback                │
│   └─────────────────┘                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Quick Diagnosis Script:

BASH(37 lines)
Code
Loading syntax highlighter...

7. Common Issues and Solutions

Issue: Container OOM Killed

BASH(13 lines)
Code
Loading syntax highlighter...

Issue: Container Restarting Loop

BASH(18 lines)
Code
Loading syntax highlighter...

Issue: High CPU Usage

BASH(21 lines)
Code
Loading syntax highlighter...

Issue: Network Connectivity

BASH(19 lines)
Code
Loading syntax highlighter...

8. Production Monitoring Checklist

YAML(44 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: No Alerts Until Outage

YAML(7 lines)
Code
Loading syntax highlighter...

Mistake 2: Alert Fatigue

YAML(9 lines)
Code
Loading syntax highlighter...

Mistake 3: No Log Retention Policy

YAML(14 lines)
Code
Loading syntax highlighter...

Mistake 4: Missing Context in Logs

JAVASCRIPT(12 lines)
Code
Loading syntax highlighter...

Mistake 5: No Baseline

YAML(8 lines)
Code
Loading syntax highlighter...

🐛 Debug This

Production alert: "Container api memory > 90%". Dashboard shows memory climbing steadily for the past 3 days.

Day 1: 40% → Day 2: 60% → Day 3: 80% → Now: 92%

No code changes were deployed. What's happening and how do you fix it?

Click to reveal analysis

Diagnosis: This is a classic memory leak pattern - steady growth without code changes means:

Application-level memory leak
Connection leak (database, redis, etc.)
Cache without eviction
Event listeners not cleaned up

Investigation steps:

BASH(19 lines)
Code
Loading syntax highlighter...

Common fixes:

Restart immediately to restore service:
```
BASH
Code
Loading syntax highlighter...
```
Investigate root cause:
- Add memory profiling to application
- Check for unclosed database connections
- Look for growing arrays/objects
- Check event listener cleanup

Add protection:

YAML(7 lines)
Code
Loading syntax highlighter...

Long-term fix: Profile application, find and fix the leak.

💻 Exercises

Exercise 1: Basic Monitoring Stack

Set up Prometheus + Grafana + cAdvisor for a simple web application. Create a dashboard showing:

Container CPU usage
Container memory usage
Network I/O

Exercise 2: Alerting

Add alertmanager and configure alerts for:

Container memory > 85%
Container restarting > 3 times/hour
Host disk space < 10%

Exercise 3: Centralized Logging

Add Loki and Promtail. Create LogQL queries for:

All errors from a specific service
Request latency > 1s
Count of errors per minute

Exercise 4: Custom Metrics

Add application metrics (request count, latency histogram) and:

Expose them on /metrics endpoint
Scrape with Prometheus
Create dashboard panel

Exercise 5: Runbook

Create a troubleshooting runbook for:

High memory usage
Service unreachable
High error rate Include detection, diagnosis, and remediation steps.

🎤 Interview Questions

Q1: How do you monitor Docker containers in production?

Answer: A complete monitoring stack includes:

Metrics Collection:
- cAdvisor for container metrics (CPU, memory, network, disk)
- node-exporter for host metrics
- Application metrics exposed on /metrics
Storage and Querying:
- Prometheus for metrics
- Loki for logs
- Retention policies for both
Visualization:
- Grafana dashboards
- Key panels: resource usage, error rates, latency
Alerting:
- Prometheus alerting rules
- Alertmanager for routing
- Slack/PagerDuty integration

Key metrics to watch:

Memory usage % of limit (alert at 85%)
Container restart count
Error rate and latency
Disk space

Q2: How do you troubleshoot a container that's using high CPU?

Answer: Systematic approach:

Identify the container:
```
BASH
Code
Loading syntax highlighter...
```

Check what process is consuming CPU:

BASH(2 lines)
Code
Loading syntax highlighter...

Analyze the application:

BASH(4 lines)
Code
Loading syntax highlighter...

Check for patterns:
- Spike vs sustained high CPU
- Correlation with traffic
- Recent changes
Take action:
- Scale horizontally if legitimate load
- Add CPU limits to prevent hogging
- Fix code if inefficiency found
- Rollback if recent deployment caused it

Q3: What's the difference between container memory usage and the memory limit?

Answer:

Memory Usage (container_memory_usage_bytes):

Current memory used by the container
Includes RSS, cache, swap

Memory Limit (container_spec_memory_limit_bytes):

Maximum allowed by cgroup
Set via --memory or compose deploy.resources.limits.memory

Why this matters:

PROMQL(2 lines)
Code
Loading syntax highlighter...

When usage approaches limit:

70-85%: Warning zone
85-95%: Critical, action needed
"

95%: OOM kill imminent

Best practice: Alert on percentage, not absolute value. A container using 500MB might be fine (if limit is 2GB) or critical (if limit is 512MB).

Q4: How do you implement log aggregation for Docker containers?

Answer: Using Loki stack:

Deploy Loki (log database):

YAML(5 lines)
Code
Loading syntax highlighter...

Deploy Promtail (log collector):

YAML(5 lines)
Code
Loading syntax highlighter...

Configure discovery:

YAML(4 lines)
Code
Loading syntax highlighter...

Query in Grafana:

LOGQL(2 lines)
Code
Loading syntax highlighter...

Key considerations:

Set retention limits (default 7 days)
Index only necessary labels
Use JSON logging for structured queries

Q5: How do you prevent alert fatigue while still catching real issues?

Answer: Strategic alert configuration:

Meaningful thresholds:

YAML(6 lines)
Code
Loading syntax highlighter...

Appropriate severity levels:
- Critical: Pages on-call immediately
- Warning: Slack notification, review next day
- Info: Dashboard only

Alert on symptoms, not causes:

YAML(7 lines)
Code
Loading syntax highlighter...

Group related alerts:

YAML(3 lines)
Code
Loading syntax highlighter...

Regular review: If an alert never fires, consider removing. If it fires daily and is ignored, fix threshold.

📝 Summary & Key Takeaways

Monitoring Stack

Metrics: Prometheus + cAdvisor
Logs: Loki + Promtail
Visualization: Grafana
Alerts: Alertmanager

Key Metrics

Metric	Warning	Critical
Memory %	85%	95%
CPU %	80%	90%
Disk %	80%	90%
Restart count	3/hour	5/hour

Troubleshooting Flow

Check dashboards
Check logs
Check resources
Deep dive (exec, inspect)
Take action (restart, scale, rollback)

📋 Quick Reference

BASH(13 lines)
Code
Loading syntax highlighter...

PROMQL(4 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

Day 1: Set up Prometheus + Grafana + cAdvisor
Day 3: Create basic dashboard
Day 7: Add alerting rules
Day 14: Add centralized logging
Day 30: Build complete runbook

Previous: Part 18 - Production Deployment Patterns
Next: Part 20 - Docker Cheatsheet & Decision Guide
Index: Docker Compendium Series