Devops
Monitoring & Troubleshooting Production
📋 At a Glance
| Aspect | Details |
|---|---|
| Difficulty | 🟠 Advanced |
| Prerequisites | Part 11 (Logging), Part 13 (Debugging) |
| Stack | Prometheus, Grafana, Loki, cAdvisor |
| Time Investment | 28 minutes read + 90 minutes practice |
| Payoff | Find issues before users report them |
🎯 What You'll Learn
After this article, you'll be able to:
- Set up metrics collection with Prometheus and cAdvisor
- Build dashboards in Grafana for container monitoring
- Implement centralized logging with Loki
- Create alerts for production issues
- Systematically troubleshoot production problems
🔥 Production Story: The Silent Memory Leak
The Setup: E-commerce site, holiday season. Everything running on Docker Compose. No monitoring "because we're small."
Day 1: Site runs fine.
Day 3: Occasional slow responses.
Day 5: 503 errors during peak hours.
Day 7: Complete outage at 2 AM. Black Friday.
The Investigation (too late):
BASH(8 lines)CodeLoading syntax highlighter...
Root Cause: Memory leak in API service. Without monitoring, nobody noticed the gradual increase until it was too late.
What Monitoring Would Have Shown:
Day 1: api memory = 400MB ✓ Day 3: api memory = 1.2GB ⚠️ Alert triggered Day 5: api memory = 1.6GB (restart, investigate)
The Fix: Prometheus + Grafana + alerts. Now they catch issues before users do.
🧠 Mental Model: Observability Stack
┌─────────────────────────────────────────────────────────────────┐ │ OBSERVABILITY PYRAMID │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ DASHBOARDS │ │ │ │ (Grafana - Visualization) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌──────────────┬────────┴────────┬──────────────┐ │ │ │ METRICS │ LOGS │ TRACES │ │ │ │ Prometheus │ Loki │ Jaeger │ │ │ │ │ │ (optional) │ │ │ └──────────────┴─────────────────┴──────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ COLLECTORS │ │ │ │ cAdvisor node_exporter Promtail App metrics │ │ │ │ (containers) (host) (logs) (custom) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ DATA SOURCES │ │ │ │ Docker Daemon Host OS Application Network │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ Data Flow: Containers → cAdvisor → Prometheus → Grafana → Alerts ↓ Logs → Promtail → Loki → Grafana
🔬 Deep Dive
1. Metrics Collection with cAdvisor and Prometheus
Docker Compose Monitoring Stack:
YAML(94 lines)CodeLoading syntax highlighter...
Prometheus Configuration:
YAML(40 lines)CodeLoading syntax highlighter...
2. Key Metrics to Monitor
Container Metrics (from cAdvisor):
PROMQL(19 lines)CodeLoading syntax highlighter...
Host Metrics (from node-exporter):
PROMQL(12 lines)CodeLoading syntax highlighter...
3. Alert Rules
YAML(82 lines)CodeLoading syntax highlighter...
Alertmanager Configuration:
YAML(32 lines)CodeLoading syntax highlighter...
4. Centralized Logging with Loki
Add Loki to Monitoring Stack:
YAML(28 lines)CodeLoading syntax highlighter...
Promtail Configuration:
YAML(28 lines)CodeLoading syntax highlighter...
Loki Configuration:
YAML(45 lines)CodeLoading syntax highlighter...
LogQL Queries in Grafana:
LOGQL(20 lines)CodeLoading syntax highlighter...
5. Grafana Dashboard for Docker
Dashboard JSON (simplified):
JSON(64 lines)CodeLoading syntax highlighter...
Grafana Provisioning:
YAML(15 lines)CodeLoading syntax highlighter...
6. Troubleshooting Runbook
Systematic Troubleshooting Process:
┌─────────────────────────────────────────────────────────────────┐ │ TROUBLESHOOTING FLOWCHART │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ SYMPTOM DETECTED │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Check Dashboard │ → Memory, CPU, Network, Errors │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Check Logs │ → Recent errors, patterns │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Check Resources │ → docker stats, df -h, free │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Deep Dive │ → exec, inspect, strace │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Take Action │ → Restart, Scale, Rollback │ │ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Quick Diagnosis Script:
BASH(37 lines)CodeLoading syntax highlighter...
7. Common Issues and Solutions
Issue: Container OOM Killed
BASH(13 lines)CodeLoading syntax highlighter...
Issue: Container Restarting Loop
BASH(18 lines)CodeLoading syntax highlighter...
Issue: High CPU Usage
BASH(21 lines)CodeLoading syntax highlighter...
Issue: Network Connectivity
BASH(19 lines)CodeLoading syntax highlighter...
8. Production Monitoring Checklist
YAML(44 lines)CodeLoading syntax highlighter...
⚠️ Common Mistakes
Mistake 1: No Alerts Until Outage
YAML(7 lines)CodeLoading syntax highlighter...
Mistake 2: Alert Fatigue
YAML(9 lines)CodeLoading syntax highlighter...
Mistake 3: No Log Retention Policy
YAML(14 lines)CodeLoading syntax highlighter...
Mistake 4: Missing Context in Logs
JAVASCRIPT(12 lines)CodeLoading syntax highlighter...
Mistake 5: No Baseline
YAML(8 lines)CodeLoading syntax highlighter...
🐛 Debug This
Production alert: "Container api memory > 90%". Dashboard shows memory climbing steadily for the past 3 days.
Day 1: 40% → Day 2: 60% → Day 3: 80% → Now: 92%
No code changes were deployed. What's happening and how do you fix it?
Click to reveal analysis
Diagnosis:
This is a classic memory leak pattern - steady growth without code changes means:
- Application-level memory leak
- Connection leak (database, redis, etc.)
- Cache without eviction
- Event listeners not cleaned up
Investigation steps:
BASH(19 lines)CodeLoading syntax highlighter...
Common fixes:
-
Restart immediately to restore service:BASHCodeLoading syntax highlighter...
-
Investigate root cause:
- Add memory profiling to application
- Check for unclosed database connections
- Look for growing arrays/objects
- Check event listener cleanup
-
Add protection:YAML(7 lines)CodeLoading syntax highlighter...
-
Long-term fix: Profile application, find and fix the leak.
💻 Exercises
Exercise 1: Basic Monitoring Stack
Set up Prometheus + Grafana + cAdvisor for a simple web application. Create a dashboard showing:
- Container CPU usage
- Container memory usage
- Network I/O
Exercise 2: Alerting
Add alertmanager and configure alerts for:
- Container memory > 85%
- Container restarting > 3 times/hour
- Host disk space < 10%
Exercise 3: Centralized Logging
Add Loki and Promtail. Create LogQL queries for:
- All errors from a specific service
- Request latency > 1s
- Count of errors per minute
Exercise 4: Custom Metrics
Add application metrics (request count, latency histogram) and:
- Expose them on /metrics endpoint
- Scrape with Prometheus
- Create dashboard panel
Exercise 5: Runbook
Create a troubleshooting runbook for:
- High memory usage
- Service unreachable
- High error rate Include detection, diagnosis, and remediation steps.
🎤 Interview Questions
Q1: How do you monitor Docker containers in production?
Answer: A complete monitoring stack includes:
-
Metrics Collection:
- cAdvisor for container metrics (CPU, memory, network, disk)
- node-exporter for host metrics
- Application metrics exposed on /metrics
-
Storage and Querying:
- Prometheus for metrics
- Loki for logs
- Retention policies for both
-
Visualization:
- Grafana dashboards
- Key panels: resource usage, error rates, latency
-
Alerting:
- Prometheus alerting rules
- Alertmanager for routing
- Slack/PagerDuty integration
Key metrics to watch:
- Memory usage % of limit (alert at 85%)
- Container restart count
- Error rate and latency
- Disk space
Q2: How do you troubleshoot a container that's using high CPU?
Answer: Systematic approach:
-
Identify the container:BASHCodeLoading syntax highlighter...
-
Check what process is consuming CPU:BASH(2 lines)CodeLoading syntax highlighter...
-
Analyze the application:BASH(4 lines)CodeLoading syntax highlighter...
-
Check for patterns:
- Spike vs sustained high CPU
- Correlation with traffic
- Recent changes
-
Take action:
- Scale horizontally if legitimate load
- Add CPU limits to prevent hogging
- Fix code if inefficiency found
- Rollback if recent deployment caused it
Q3: What's the difference between container memory usage and the memory limit?
Answer:
Memory Usage (
container_memory_usage_bytes):- Current memory used by the container
- Includes RSS, cache, swap
Memory Limit (
container_spec_memory_limit_bytes):- Maximum allowed by cgroup
- Set via
--memoryor composedeploy.resources.limits.memory
Why this matters:
PROMQL(2 lines)CodeLoading syntax highlighter...
When usage approaches limit:
- 70-85%: Warning zone
- 85-95%: Critical, action needed
-
"
95%: OOM kill imminent
Best practice: Alert on percentage, not absolute value. A container using 500MB might be fine (if limit is 2GB) or critical (if limit is 512MB).
Q4: How do you implement log aggregation for Docker containers?
Answer: Using Loki stack:
-
Deploy Loki (log database):YAML(5 lines)CodeLoading syntax highlighter...
-
Deploy Promtail (log collector):YAML(5 lines)CodeLoading syntax highlighter...
-
Configure discovery:YAML(4 lines)CodeLoading syntax highlighter...
-
Query in Grafana:LOGQL(2 lines)CodeLoading syntax highlighter...
Key considerations:
- Set retention limits (default 7 days)
- Index only necessary labels
- Use JSON logging for structured queries
Q5: How do you prevent alert fatigue while still catching real issues?
Answer: Strategic alert configuration:
-
Meaningful thresholds:YAML(6 lines)CodeLoading syntax highlighter...
-
Appropriate severity levels:
- Critical: Pages on-call immediately
- Warning: Slack notification, review next day
- Info: Dashboard only
-
Alert on symptoms, not causes:YAML(7 lines)CodeLoading syntax highlighter...
-
Group related alerts:YAML(3 lines)CodeLoading syntax highlighter...
-
Regular review: If an alert never fires, consider removing. If it fires daily and is ignored, fix threshold.
📝 Summary & Key Takeaways
Monitoring Stack
- Metrics: Prometheus + cAdvisor
- Logs: Loki + Promtail
- Visualization: Grafana
- Alerts: Alertmanager
Key Metrics
| Metric | Warning | Critical |
|---|---|---|
| Memory % | 85% | 95% |
| CPU % | 80% | 90% |
| Disk % | 80% | 90% |
| Restart count | 3/hour | 5/hour |
Troubleshooting Flow
- Check dashboards
- Check logs
- Check resources
- Deep dive (exec, inspect)
- Take action (restart, scale, rollback)
📋 Quick Reference
BASH(13 lines)CodeLoading syntax highlighter...
PROMQL(4 lines)CodeLoading syntax highlighter...
📅 Review Schedule
- Day 1: Set up Prometheus + Grafana + cAdvisor
- Day 3: Create basic dashboard
- Day 7: Add alerting rules
- Day 14: Add centralized logging
- Day 30: Build complete runbook