Monitoring & Operations
At a Glance
| Aspect | Details |
|---|---|
| Goal | Production-ready monitoring and operational excellence |
| Metrics Exposure | JMX (native), Prometheus (via exporters), Micrometer (Spring) |
| Key Broker Metrics | UnderReplicatedPartitions, ActiveControllerCount, RequestQueueSize |
| Key Consumer Metrics | ConsumerLag, records-lag-max, commit-rate |
| Key Producer Metrics | record-send-rate, record-error-rate, batch-size-avg |
| Alerting Strategy | Symptoms first, then causes; page on customer impact |
| Prerequisites | Parts 1-13 (cluster, producers, consumers) |
What You'll Learn
- Essential broker, producer, and consumer metrics
- Setting up Prometheus + Grafana monitoring stack
- Consumer lag monitoring with dedicated tools
- Alerting strategies that reduce noise
- Operational runbooks for common issues
- Capacity planning and performance tuning
- Spring Boot Kafka metrics integration
Production Story: The Silent 2-Hour Lag
"Why are customers seeing stale inventory?" The e-commerce team was confused. Their real-time inventory system showed items in stock that had sold out hours ago.
Mental Model: The Observability Triangle
┌─────────────────────────────────────────┐ │ KAFKA OBSERVABILITY │ └─────────────────────────────────────────┘ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ METRICS │ │ LOGS │ │ TRACES │ │ │ │ │ │ │ │ • Counters │ │ • Broker logs │ │ • End-to-end │ │ • Gauges │ │ • App logs │ │ • Correlation │ │ • Histograms │ │ • Audit logs │ │ • Latency │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └──────────────────────┼──────────────────────┘ │ ┌───────────────┴───────────────┐ │ │ ▼ ▼ ┌───────────────┐ ┌───────────────┐ │ DASHBOARDS │ │ ALERTS │ │ │ │ │ │ • Grafana │ │ • PagerDuty │ │ • Datadog │ │ • OpsGenie │ │ • Kibana │ │ • Slack │ └───────────────┘ └───────────────┘
The Metrics Pipeline
┌──────────────────────────────────────────────────────────────────────────┐ │ METRICS COLLECTION FLOW │ └──────────────────────────────────────────────────────────────────────────┘ KAFKA CLUSTER EXPORTERS STORAGE ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │ │ │ │ │ │ │ Broker 1 (JMX) │─────────────▶│ JMX Exporter │────────▶│ │ │ :9999 │ │ :7071 │ │ Prometheus │ │ │ │ │ │ │ ├─────────────────┤ ├─────────────────┤ │ │ │ │ │ │ │ (scrapes │ │ Broker 2 (JMX) │─────────────▶│ JMX Exporter │────────▶│ every │ │ :9999 │ │ :7071 │ │ 15-30s) │ │ │ │ │ │ │ ├─────────────────┤ ├─────────────────┤ │ │ │ │ │ │ │ │ │ Broker 3 (JMX) │─────────────▶│ JMX Exporter │────────▶│ │ │ :9999 │ │ :7071 │ │ │ └─────────────────┘ └─────────────────┘ └──────┬──────┘ │ APPLICATIONS │ ┌─────────────────┐ │ │ Spring Boot │ │ │ App │─────────────────────────────────────────────────┤ │ :8080/actuator │ │ │ /prometheus │ │ └─────────────────┘ │ ▼ LAG MONITORING ┌─────────────┐ ┌─────────────────┐ │ Grafana │ │ │ │ │ │ Burrow / │───────────────────────────────────────────│ Dashboards │ │ Kafka Lag │ │ & Alerts │ │ Exporter │ │ │ └─────────────────┘ └─────────────┘
Deep Dive: Broker Metrics
Critical Broker Metrics
┌──────────────────────────────────────────────────────────────────────────┐ │ CRITICAL BROKER METRICS │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Metric │ What It Means │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ UnderReplicatedPartitions │ Followers falling behind leaders │ │ │ ⚠️ >0 for extended time = replication │ │ │ issues or broker problems │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ OfflinePartitionsCount │ Partitions with no leader │ │ │ 🚨 >0 = DATA UNAVAILABLE │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ ActiveControllerCount │ Number of active controllers │ │ │ ✅ Should be exactly 1 cluster-wide │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ UncleanLeaderElectionsPerSec │ Elections from out-of-sync replicas │ │ │ 🚨 >0 = POTENTIAL DATA LOSS │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ IsrShrinksPerSec / │ Rate of ISR changes │ │ IsrExpandsPerSec │ ⚠️ High rate = unstable replication │ └────────────────────────────────┴─────────────────────────────────────────┘
JMX Exporter Configuration
YAML(55 lines)CodeLoading syntax highlighter...
Broker Startup with JMX Exporter
BASH(5 lines)CodeLoading syntax highlighter...
Deep Dive: Producer Metrics
Spring Kafka Producer Metrics
JAVA(32 lines)CodeLoading syntax highlighter...
Critical Producer Metrics
┌──────────────────────────────────────────────────────────────────────────┐ │ PRODUCER METRICS TO MONITOR │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Metric │ What It Means │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ record-send-rate │ Records sent per second │ │ │ 📊 Baseline for throughput │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ record-error-rate │ Failed sends per second │ │ │ 🚨 >0 sustained = problem │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ record-retry-rate │ Retries per second │ │ │ ⚠️ High rate = network/broker issues │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ request-latency-avg │ Average request time to broker │ │ │ 📊 Monitor for degradation │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ batch-size-avg │ Average batch size in bytes │ │ │ 📊 Tune linger.ms/batch.size │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ buffer-available-bytes │ Free buffer memory │ │ │ 🚨 Near 0 = producer blocked │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ waiting-threads │ Threads blocked on buffer │ │ │ 🚨 >0 = buffer exhaustion │ └────────────────────────────────┴─────────────────────────────────────────┘
Custom Producer Metrics Service
JAVA(88 lines)CodeLoading syntax highlighter...
Deep Dive: Consumer Metrics
The Most Important Metric: Consumer Lag
┌──────────────────────────────────────────────────────────────────────────┐ │ CONSUMER LAG EXPLAINED │ └──────────────────────────────────────────────────────────────────────────┘ Topic: orders, Partition: 0 Offsets: 0 1 2 3 4 5 6 7 8 9 10 11 12 ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐ Messages: │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ░ │ ░ │ ░ │ ░ │ ░ │ │ └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘ ▲ ▲ │ │ Consumer Offset = 7 Log End Offset = 12 (last committed) (latest produced) └───────────────┬───────────────┘ │ CONSUMER LAG = 5 (messages behind) ✓ = Processed and committed ░ = Produced but not yet consumed
Spring Kafka Consumer Metrics
JAVA(42 lines)CodeLoading syntax highlighter...
Critical Consumer Metrics
┌──────────────────────────────────────────────────────────────────────────┐ │ CONSUMER METRICS TO MONITOR │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Metric │ What It Means │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ records-lag-max │ Maximum lag across all partitions │ │ │ 🚨 Growing = falling behind │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ records-consumed-rate │ Records consumed per second │ │ │ 📊 Compare with producer rate │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ fetch-latency-avg │ Time to fetch from broker │ │ │ 📊 Network/broker health │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ commit-rate │ Commits per second │ │ │ 📊 Should match consumption pattern │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ rebalance-rate-per-hour │ Rebalances per hour │ │ │ ⚠️ >1-2/hour = instability │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ last-poll-seconds-ago │ Time since last poll │ │ │ 🚨 > max.poll.interval.ms = will kick │ └────────────────────────────────┴─────────────────────────────────────────┘
Custom Lag Monitoring Service
JAVA(94 lines)CodeLoading syntax highlighter...
Deep Dive: Dedicated Lag Monitoring Tools
Burrow Setup
Burrow is LinkedIn's open-source consumer lag monitoring tool.
YAML(52 lines)CodeLoading syntax highlighter...
Burrow API Usage
JAVA(72 lines)CodeLoading syntax highlighter...
Kafka Lag Exporter (Prometheus-native)
YAML(11 lines)CodeLoading syntax highlighter...
HOCON(30 lines)CodeLoading syntax highlighter...
Deep Dive: Spring Boot Actuator Integration
Complete Metrics Configuration
YAML(34 lines)CodeLoading syntax highlighter...
Health Indicator for Kafka
JAVA(93 lines)CodeLoading syntax highlighter...
Deep Dive: Alerting Strategy
The Alerting Pyramid
┌──────────────────────────────────────────────────────────────────────────┐ │ ALERTING PYRAMID │ └──────────────────────────────────────────────────────────────────────────┘ ┌─────────┐ │ PAGE │ ← Customer impact │ (Wake) │ Immediate action needed └────┬────┘ │ ┌─────────┴─────────┐ │ TICKET │ ← Degradation │ (Next shift) │ Needs attention soon └─────────┬─────────┘ │ ┌──────────────┴──────────────┐ │ SLACK │ ← Warning signs │ (Informational) │ Worth knowing └──────────────┬──────────────┘ │ ┌───────────────────────┴───────────────────────┐ │ DASHBOARD │ ← Everything │ (Always visible) │ For investigation └───────────────────────────────────────────────┘
Prometheus Alerting Rules
YAML(102 lines)CodeLoading syntax highlighter...
AlertManager Configuration
YAML(55 lines)CodeLoading syntax highlighter...
Deep Dive: Grafana Dashboards
Key Dashboard Panels (JSON)
JSON(81 lines)CodeLoading syntax highlighter...
Deep Dive: Operational Runbooks
Runbook: High Consumer Lag
┌──────────────────────────────────────────────────────────────────────────┐ │ RUNBOOK: HIGH CONSUMER LAG │ └──────────────────────────────────────────────────────────────────────────┘ SYMPTOMS: - Consumer lag metric > threshold (e.g., 100,000 messages) - Processing latency increasing - Data freshness complaints DIAGNOSIS STEPS: 1. Check if consumer is running $ kafka-consumer-groups.sh --bootstrap-server <broker> \ --describe --group <group-id> Look for: - CONSUMER-ID column: Should show active consumers - LAG column: Per-partition lag - LOG-END-OFFSET vs CURRENT-OFFSET: Gap = lag 2. Check consumer health $ curl http://<app>:8080/actuator/health Look for: - Kafka health indicator status - Thread pool exhaustion - Memory pressure 3. Check processing rate $ curl http://<app>:8080/actuator/prometheus | grep kafka_consumer Look for: - records-consumed-rate: Should be > 0 - poll-rate: Should be consistent - commit-rate: Should be > 0 4. Check for rebalancing $ kafka-consumer-groups.sh --bootstrap-server <broker> \ --describe --group <group-id> --state Look for: - State: Should be "Stable", not "PreparingRebalance" 5. Check downstream dependencies - Database response times - External API latency - Queue depths REMEDIATION: If consumer stopped: - Check logs for exceptions - Restart consumer application - Verify connectivity to Kafka If processing slow: - Increase consumer concurrency (more partitions/consumers) - Optimize processing logic - Scale downstream services - Consider batch processing If continuous rebalancing: - Increase session.timeout.ms - Increase max.poll.interval.ms - Reduce max.poll.records - Check for consumer crashes If producer rate spike: - Scale consumer instances - Enable consumer batching - Consider backpressure ESCALATION: - If lag continues growing after 30 minutes: Page on-call - If data loss suspected: Notify data team
Runbook: Under-Replicated Partitions
┌──────────────────────────────────────────────────────────────────────────┐ │ RUNBOOK: UNDER-REPLICATED PARTITIONS │ └──────────────────────────────────────────────────────────────────────────┘ SYMPTOMS: - UnderReplicatedPartitions metric > 0 - ISRShrinkRate increasing - Produce latency may increase (with acks=all) DIAGNOSIS STEPS: 1. Identify affected partitions $ kafka-topics.sh --bootstrap-server <broker> \ --describe --under-replicated-partitions 2. Check broker health $ kafka-broker-api-versions.sh --bootstrap-server <broker> For each broker, check: - Is broker responding? - Response time normal? 3. Check broker logs $ tail -f /var/log/kafka/server.log | grep -E "(ERROR|WARN)" Look for: - Disk full errors - Network timeouts - OutOfMemory errors 4. Check disk space $ df -h /var/kafka-logs Should have > 20% free 5. Check replication lag $ kafka-replica-verification.sh --broker-list <brokers> \ --topics-include ".*" 6. Check network between brokers $ ping <other-broker> $ nc -zv <other-broker> 9092 REMEDIATION: If disk full: - Delete old log segments (if retention allows) - Expand disk - Move partitions to other brokers If broker down: - Restart broker - Check for hardware issues - Replace broker if needed If network issues: - Check firewall rules - Verify DNS resolution - Contact network team If broker overloaded: - Reassign partitions to spread load - Add more brokers - Increase broker resources VERIFICATION: - UnderReplicatedPartitions should return to 0 - ISRExpandRate should spike (replicas catching up) - All partitions should show full ISR ESCALATION: - If affects > 10% of partitions: Page on-call - If not resolved in 15 minutes: Page platform lead
Deep Dive: Capacity Planning
Sizing Formula
┌──────────────────────────────────────────────────────────────────────────┐ │ CAPACITY PLANNING FORMULAS │ └──────────────────────────────────────────────────────────────────────────┘ DISK CAPACITY: Daily data = (messages/second) × (avg message size) × 86400 seconds Total disk = Daily data × retention days × replication factor × 1.2 (overhead) Example: 1000 msg/s × 1KB × 86400 = 86.4 GB/day 86.4 GB × 7 days × 3 replicas × 1.2 = 2.18 TB total NETWORK BANDWIDTH: Inbound = messages/second × avg message size × replication factor Outbound = messages/second × avg message size × consumer count Example: Inbound: 1000 msg/s × 1KB × 3 = 3 MB/s Outbound: 1000 msg/s × 1KB × 5 consumers = 5 MB/s PARTITION COUNT: partitions = max( throughput_required / throughput_per_partition, consumer_instances ) Rule of thumb: 1 partition can handle ~10 MB/s Example: 50 MB/s throughput needed, 10 consumers partitions = max(50/10, 10) = 10 partitions BROKER COUNT: brokers = max( total_partitions × replication_factor / partitions_per_broker, total_disk_needed / disk_per_broker, total_network / network_per_broker ) Rule of thumb: 4000 partitions per broker max
Capacity Monitoring
JAVA(57 lines)CodeLoading syntax highlighter...
Common Mistakes
1. Alerting on Symptoms, Not Causes
❌ WRONG: Alert on every metric deviation - Alert: "BytesIn increased 50%" - Alert: "Request latency up 20%" - Result: Alert fatigue, ignored alerts ✅ RIGHT: Alert on customer impact - Alert: "Consumer lag > threshold" (freshness impact) - Alert: "Offline partitions" (availability impact) - Alert: "Producer error rate > 1%" (data loss risk)
2. Missing Consumer Lag Monitoring
❌ WRONG: Only monitor application health - Consumer health: UP - No errors in logs - Result: 2 hours behind, nobody knows ✅ RIGHT: Dedicated lag monitoring - Burrow or Kafka Lag Exporter - Per-partition lag visibility - Alerts on growing lag, not just high lag
3. Too Many Dashboards, No Hierarchy
❌ WRONG: One giant dashboard with everything - 50 panels - Takes 30 seconds to load - Nobody knows where to look ✅ RIGHT: Dashboard hierarchy - L1: Executive overview (4-6 panels) - L2: Service-level view (topic/group focused) - L3: Deep-dive troubleshooting
4. Not Monitoring the Monitoring
❌ WRONG: Assume monitoring always works - JMX exporter crashes - Prometheus storage full - Nobody knows until outage ✅ RIGHT: Meta-monitoring - Alert on missing metrics - Prometheus storage alerts - Exporter health checks
Debug This: The Invisible Consumer
A consumer group shows as "Stable" with active members, but messages are piling up:
$ kafka-consumer-groups.sh --describe --group order-processor GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG order-processor orders 0 1000 50000 49000 order-processor orders 1 1000 50000 49000 order-processor orders 2 1000 50000 49000 CONSUMER-ID HOST order-consumer-1-xxx /10.0.0.5 order-consumer-2-xxx /10.0.0.6 order-consumer-3-xxx /10.0.0.7
What's happening? (Answer below)
Answer
Look at CURRENT-OFFSET: It's exactly 1000 for all partitions. This means:
- Consumers connected and got initial offsets
- Offsets haven't moved since
-
Processing blocked - First message throws exception, consumer stuck retrying:JAVA(2 lines)CodeLoading syntax highlighter...
-
Commits failing silently - Consumer processes but can't commit:JAVA(2 lines)CodeLoading syntax highlighter...
-
Wrong deserializer - Messages fail to deserialize:JAVA(2 lines)CodeLoading syntax highlighter...
-
Consumer paused - Deliberately paused and never resumed:JAVACodeLoading syntax highlighter...
BASH(10 lines)CodeLoading syntax highlighter...
Exercises
Exercise 1: Build a Lag Dashboard
Create a Grafana dashboard that shows:
- Total lag per consumer group (gauge)
- Lag trend over time (graph)
- Lag by partition heatmap
- Alert thresholds visually marked
Exercise 2: Implement Custom Health Check
Build a Spring Boot health indicator that:
- Checks all assigned partitions
- Verifies consumer is within N messages of head
- Reports partition-level details
- Returns DOWN if any partition has critical lag
Exercise 3: Create Alerting Rules
Design Prometheus alerting rules for:
- Consumer stopped (no commits in 10 minutes)
- Producer backpressure (buffer utilization > 80%)
- Broker disk usage > 75%
- Request latency P99 > 500ms
Exercise 4: Build Capacity Report
Create a scheduled job that:
- Calculates current disk usage per topic
- Projects when retention will be exceeded
- Reports partition distribution across brokers
- Warns on imbalanced partitions
Exercise 5: Operational Runbook
Write a runbook for:
- Consumer repeatedly rebalancing
- Include diagnostic commands
- Decision tree for root causes
- Remediation steps for each cause
Interview Questions
Q1: "How would you detect a slow consumer before it impacts users?"
-
Lag velocity monitoring - Alert not just on high lag, but on growing lag:
deriv(kafka_consumer_lag[15m]) > thresholdThis catches problems before lag becomes critical.
-
Processing rate comparison - Compare consumer throughput to producer:
rate(consumed) < rate(produced) * 0.8If consuming less than 80% of production rate, we're falling behind.
-
Commit frequency - Monitor commits per minute:
rate(commits) < expected_rateSudden drop indicates processing slowdown.
-
End-to-end latency - Measure time from produce to consume complete:
- Embed timestamp in message
- Measure on consumption
- Alert on P99 latency increase
Q2: "Your monitoring shows UnderReplicatedPartitions > 0 but no offline partitions. What do you check?"
-
Which broker's replicas are behind:BASHCodeLoading syntax highlighter...
If all from one broker = that broker has issues.
-
Broker resource constraints:
- Disk I/O: Follower can't write fast enough
- Network: Can't fetch from leader fast enough
- CPU: If compressed, decompression bottleneck
-
Replication lag metric:
kafka_server_replica_fetcher_max_lagShows how far behind in bytes.
-
Network between brokers:
- Latency spikes
- Packet loss
- Bandwidth saturation
-
Leader distribution:
- One broker leader-heavy = unbalanced fetch load
- Run preferred replica election if needed.
If it's transient (< 5 minutes) during rolling restart or deployment, usually okay. Sustained means real issue."
Q3: "How do you size a Kafka cluster for a new use case?"
- Expected message throughput (msg/sec)
- Average message size
- Peak vs. average ratio
- Retention period
- Replication factor
- Consumer count
Daily volume = throughput × msg_size × 86400 Total storage = daily × retention × RF × 1.5 (safety margin) Network in = throughput × msg_size × RF Network out = throughput × msg_size × consumers Partitions = max( target_throughput / 10MB_per_partition, consumer_parallelism )
- Start with 3 brokers (minimum for HA)
- Each broker: 4000 partitions max
- Consider: disk IOPS, network bandwidth, memory for page cache
- Provision 30% overhead for spikes
- Plan for 2x growth in 12 months
- Set up monitoring from day 1 to validate assumptions
Then iterate based on actual metrics."
Q4: "What metrics would you look at during a Kafka outage?"
- OfflinePartitionsCount: Any data unavailable?
- ActiveControllerCount: Is there a controller?
- Broker up/down status
- Which topics/partitions affected?
- Which consumer groups impacted?
- UnderReplicatedPartitions trend
- Recent deployments or changes
- Broker logs (OutOfMemory, disk full, network)
- ISR shrink/expand rate (replication issues)
- Request latency by type (Produce, Fetch)
- Producer error rates (data loss risk)
- Consumer lag (freshness impact)
- Downstream service health
Q5: "How would you implement end-to-end latency monitoring for Kafka?"
JAVA(8 lines)CodeLoading syntax highlighter...
JAVA(7 lines)CodeLoading syntax highlighter...
- Propagate trace context in headers
- Spans for produce, broker, consume, process
- Visualize in Jaeger/Zipkin
- P50, P95, P99 end-to-end latency
- Breakdown: network vs. queue time vs. processing
- Per-topic and per-consumer-group
The challenge is clock synchronization - use NTP or embed offsets rather than absolute times when possible."
Summary & Key Takeaways
The Monitoring Hierarchy
┌──────────────────────────────────────────────────────────────────────────┐ │ KAFKA MONITORING PRIORITIES │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Priority │ Metrics │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ 1. Availability │ OfflinePartitions, ActiveController │ │ (Is data accessible?) │ │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ 2. Durability │ UnderReplicated, ISR changes │ │ (Will data survive?) │ │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ 3. Freshness │ Consumer lag, processing rate │ │ (Is data current?) │ │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ 4. Throughput │ Bytes in/out, request rate │ │ (Is capacity sufficient?) │ │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ 5. Latency │ Request times, E2E latency │ │ (Is it fast enough?) │ │ └────────────────────────────────┴─────────────────────────────────────────┘
Essential Takeaways
-
Consumer lag is king - A healthy-looking consumer can be hours behind. Always monitor lag.
-
Alert on symptoms, not causes - Page on customer impact (offline partitions, growing lag), not internal metrics (ISR shrinks).
-
Use dedicated lag tools - Burrow or Kafka Lag Exporter provide better visibility than application metrics alone.
-
Build dashboard hierarchy - Executive overview → Service view → Deep-dive troubleshooting.
-
Runbooks save incidents - Document diagnostic steps and remediation for common issues before they happen.
-
Capacity planning is continuous - Monitor trends and project forward, don't wait for alerts.
Quick Reference
┌──────────────────────────────────────────────────────────────────────────┐ │ MONITORING QUICK REFERENCE │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Tool │ Purpose │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ JMX Exporter │ Broker metrics → Prometheus │ │ kafka-lag-exporter │ Consumer lag → Prometheus │ │ Burrow │ Consumer lag + health analysis │ │ Micrometer │ Spring app metrics → Prometheus │ │ Grafana │ Visualization and alerting │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ Key Commands │ │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ Consumer group status │ kafka-consumer-groups.sh --describe │ │ Under-replicated partitions │ kafka-topics.sh --under-replicated │ │ Topic details │ kafka-topics.sh --describe │ │ Broker API check │ kafka-broker-api-versions.sh │ └────────────────────────────────┴─────────────────────────────────────────┘