Devops

Monitoring & Troubleshooting Production

📋 At a Glance

AspectDetails
Difficulty🟠 Advanced
PrerequisitesPart 11 (Logging), Part 13 (Debugging)
StackPrometheus, Grafana, Loki, cAdvisor
Time Investment28 minutes read + 90 minutes practice
PayoffFind issues before users report them

🎯 What You'll Learn

After this article, you'll be able to:

  1. Set up metrics collection with Prometheus and cAdvisor
  2. Build dashboards in Grafana for container monitoring
  3. Implement centralized logging with Loki
  4. Create alerts for production issues
  5. Systematically troubleshoot production problems

🔥 Production Story: The Silent Memory Leak

The Setup: E-commerce site, holiday season. Everything running on Docker Compose. No monitoring "because we're small."
Day 1: Site runs fine. Day 3: Occasional slow responses. Day 5: 503 errors during peak hours. Day 7: Complete outage at 2 AM. Black Friday.
The Investigation (too late):
BASH(8 lines)
Code
Loading syntax highlighter...
Root Cause: Memory leak in API service. Without monitoring, nobody noticed the gradual increase until it was too late.
What Monitoring Would Have Shown:
Day 1: api memory = 400MB ✓
Day 3: api memory = 1.2GB ⚠️ Alert triggered
Day 5: api memory = 1.6GB (restart, investigate)
The Fix: Prometheus + Grafana + alerts. Now they catch issues before users do.

🧠 Mental Model: Observability Stack

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY PYRAMID                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DASHBOARDS                           │   │
│   │              (Grafana - Visualization)                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↑                                     │
│   ┌──────────────┬────────┴────────┬──────────────┐             │
│   │   METRICS    │     LOGS        │    TRACES    │             │
│   │  Prometheus  │     Loki        │   Jaeger     │             │
│   │              │                 │   (optional) │             │
│   └──────────────┴─────────────────┴──────────────┘             │
│                           ↑                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    COLLECTORS                           │   │
│   │  cAdvisor     node_exporter    Promtail    App metrics  │   │
│   │  (containers) (host)           (logs)      (custom)     │   │
│   └─────────────────────────────────────────────────────────┘   │
│                           ↑                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    DATA SOURCES                         │   │
│   │  Docker Daemon    Host OS    Application    Network     │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Flow:
  Containers → cAdvisor → Prometheus → Grafana → Alerts
                                     ↓
  Logs → Promtail → Loki → Grafana

🔬 Deep Dive

1. Metrics Collection with cAdvisor and Prometheus

Docker Compose Monitoring Stack:
YAML(94 lines)
Code
Loading syntax highlighter...
Prometheus Configuration:
YAML(40 lines)
Code
Loading syntax highlighter...

2. Key Metrics to Monitor

Container Metrics (from cAdvisor):
PROMQL(19 lines)
Code
Loading syntax highlighter...
Host Metrics (from node-exporter):
PROMQL(12 lines)
Code
Loading syntax highlighter...

3. Alert Rules

YAML(82 lines)
Code
Loading syntax highlighter...
Alertmanager Configuration:
YAML(32 lines)
Code
Loading syntax highlighter...

4. Centralized Logging with Loki

Add Loki to Monitoring Stack:
YAML(28 lines)
Code
Loading syntax highlighter...
Promtail Configuration:
YAML(28 lines)
Code
Loading syntax highlighter...
Loki Configuration:
YAML(45 lines)
Code
Loading syntax highlighter...
LogQL Queries in Grafana:
LOGQL(20 lines)
Code
Loading syntax highlighter...

5. Grafana Dashboard for Docker

Dashboard JSON (simplified):
JSON(64 lines)
Code
Loading syntax highlighter...
Grafana Provisioning:
YAML(15 lines)
Code
Loading syntax highlighter...

6. Troubleshooting Runbook

Systematic Troubleshooting Process:
┌─────────────────────────────────────────────────────────────────┐
│                  TROUBLESHOOTING FLOWCHART                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   SYMPTOM DETECTED                                              │
│         │                                                       │
│         ▼                                                       │
│   ┌─────────────────┐                                           │
│   │ Check Dashboard │ → Memory, CPU, Network, Errors            │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │  Check Logs     │ → Recent errors, patterns                 │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Check Resources │ → docker stats, df -h, free               │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Deep Dive       │ → exec, inspect, strace                   │
│   └────────┬────────┘                                           │
│            │                                                    │
│            ▼                                                    │
│   ┌─────────────────┐                                           │
│   │ Take Action     │ → Restart, Scale, Rollback                │
│   └─────────────────┘                                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Quick Diagnosis Script:
BASH(37 lines)
Code
Loading syntax highlighter...

7. Common Issues and Solutions

Issue: Container OOM Killed
BASH(13 lines)
Code
Loading syntax highlighter...
Issue: Container Restarting Loop
BASH(18 lines)
Code
Loading syntax highlighter...
Issue: High CPU Usage
BASH(21 lines)
Code
Loading syntax highlighter...
Issue: Network Connectivity
BASH(19 lines)
Code
Loading syntax highlighter...

8. Production Monitoring Checklist

YAML(44 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: No Alerts Until Outage

YAML(7 lines)
Code
Loading syntax highlighter...

Mistake 2: Alert Fatigue

YAML(9 lines)
Code
Loading syntax highlighter...

Mistake 3: No Log Retention Policy

YAML(14 lines)
Code
Loading syntax highlighter...

Mistake 4: Missing Context in Logs

JAVASCRIPT(12 lines)
Code
Loading syntax highlighter...

Mistake 5: No Baseline

YAML(8 lines)
Code
Loading syntax highlighter...

🐛 Debug This

Production alert: "Container api memory > 90%". Dashboard shows memory climbing steadily for the past 3 days.

Day 1: 40% → Day 2: 60% → Day 3: 80% → Now: 92%

No code changes were deployed. What's happening and how do you fix it?

Click to reveal analysis
Diagnosis: This is a classic memory leak pattern - steady growth without code changes means:
  1. Application-level memory leak
  2. Connection leak (database, redis, etc.)
  3. Cache without eviction
  4. Event listeners not cleaned up
Investigation steps:
BASH(19 lines)
Code
Loading syntax highlighter...
Common fixes:
  1. Restart immediately to restore service:
    BASH
    Code
    Loading syntax highlighter...
  2. Investigate root cause:
    • Add memory profiling to application
    • Check for unclosed database connections
    • Look for growing arrays/objects
    • Check event listener cleanup
  3. Add protection:
    YAML(7 lines)
    Code
    Loading syntax highlighter...
  4. Long-term fix: Profile application, find and fix the leak.

💻 Exercises

Exercise 1: Basic Monitoring Stack

Set up Prometheus + Grafana + cAdvisor for a simple web application. Create a dashboard showing:

  • Container CPU usage
  • Container memory usage
  • Network I/O

Exercise 2: Alerting

Add alertmanager and configure alerts for:

  • Container memory > 85%
  • Container restarting > 3 times/hour
  • Host disk space < 10%

Exercise 3: Centralized Logging

Add Loki and Promtail. Create LogQL queries for:

  • All errors from a specific service
  • Request latency > 1s
  • Count of errors per minute

Exercise 4: Custom Metrics

Add application metrics (request count, latency histogram) and:

  • Expose them on /metrics endpoint
  • Scrape with Prometheus
  • Create dashboard panel

Exercise 5: Runbook

Create a troubleshooting runbook for:

  • High memory usage
  • Service unreachable
  • High error rate Include detection, diagnosis, and remediation steps.

🎤 Interview Questions

Q1: How do you monitor Docker containers in production?

Answer: A complete monitoring stack includes:
  1. Metrics Collection:
    • cAdvisor for container metrics (CPU, memory, network, disk)
    • node-exporter for host metrics
    • Application metrics exposed on /metrics
  2. Storage and Querying:
    • Prometheus for metrics
    • Loki for logs
    • Retention policies for both
  3. Visualization:
    • Grafana dashboards
    • Key panels: resource usage, error rates, latency
  4. Alerting:
    • Prometheus alerting rules
    • Alertmanager for routing
    • Slack/PagerDuty integration

Key metrics to watch:

  • Memory usage % of limit (alert at 85%)
  • Container restart count
  • Error rate and latency
  • Disk space

Q2: How do you troubleshoot a container that's using high CPU?

Answer: Systematic approach:
  1. Identify the container:
    BASH
    Code
    Loading syntax highlighter...
  2. Check what process is consuming CPU:
    BASH(2 lines)
    Code
    Loading syntax highlighter...
  3. Analyze the application:
    BASH(4 lines)
    Code
    Loading syntax highlighter...
  4. Check for patterns:
    • Spike vs sustained high CPU
    • Correlation with traffic
    • Recent changes
  5. Take action:
    • Scale horizontally if legitimate load
    • Add CPU limits to prevent hogging
    • Fix code if inefficiency found
    • Rollback if recent deployment caused it

Q3: What's the difference between container memory usage and the memory limit?

Answer:
Memory Usage (container_memory_usage_bytes):
  • Current memory used by the container
  • Includes RSS, cache, swap
Memory Limit (container_spec_memory_limit_bytes):
  • Maximum allowed by cgroup
  • Set via --memory or compose deploy.resources.limits.memory
Why this matters:
PROMQL(2 lines)
Code
Loading syntax highlighter...

When usage approaches limit:

  • 70-85%: Warning zone
  • 85-95%: Critical, action needed
  • "

    95%: OOM kill imminent

Best practice: Alert on percentage, not absolute value. A container using 500MB might be fine (if limit is 2GB) or critical (if limit is 512MB).

Q4: How do you implement log aggregation for Docker containers?

Answer: Using Loki stack:
  1. Deploy Loki (log database):
    YAML(5 lines)
    Code
    Loading syntax highlighter...
  2. Deploy Promtail (log collector):
    YAML(5 lines)
    Code
    Loading syntax highlighter...
  3. Configure discovery:
    YAML(4 lines)
    Code
    Loading syntax highlighter...
  4. Query in Grafana:
    LOGQL(2 lines)
    Code
    Loading syntax highlighter...
Key considerations:
  • Set retention limits (default 7 days)
  • Index only necessary labels
  • Use JSON logging for structured queries

Q5: How do you prevent alert fatigue while still catching real issues?

Answer: Strategic alert configuration:
  1. Meaningful thresholds:
    YAML(6 lines)
    Code
    Loading syntax highlighter...
  2. Appropriate severity levels:
    • Critical: Pages on-call immediately
    • Warning: Slack notification, review next day
    • Info: Dashboard only
  3. Alert on symptoms, not causes:
    YAML(7 lines)
    Code
    Loading syntax highlighter...
  4. Group related alerts:
    YAML(3 lines)
    Code
    Loading syntax highlighter...
  5. Regular review: If an alert never fires, consider removing. If it fires daily and is ignored, fix threshold.

📝 Summary & Key Takeaways

Monitoring Stack

  • Metrics: Prometheus + cAdvisor
  • Logs: Loki + Promtail
  • Visualization: Grafana
  • Alerts: Alertmanager

Key Metrics

MetricWarningCritical
Memory %85%95%
CPU %80%90%
Disk %80%90%
Restart count3/hour5/hour

Troubleshooting Flow

  1. Check dashboards
  2. Check logs
  3. Check resources
  4. Deep dive (exec, inspect)
  5. Take action (restart, scale, rollback)

📋 Quick Reference

BASH(13 lines)
Code
Loading syntax highlighter...
PROMQL(4 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

  • Day 1: Set up Prometheus + Grafana + cAdvisor
  • Day 3: Create basic dashboard
  • Day 7: Add alerting rules
  • Day 14: Add centralized logging
  • Day 30: Build complete runbook

📚 Series Navigation