Devops

Monitoring & Operations

At a Glance

AspectDetails
GoalProduction-ready monitoring and operational excellence
Metrics ExposureJMX (native), Prometheus (via exporters), Micrometer (Spring)
Key Broker MetricsUnderReplicatedPartitions, ActiveControllerCount, RequestQueueSize
Key Consumer MetricsConsumerLag, records-lag-max, commit-rate
Key Producer Metricsrecord-send-rate, record-error-rate, batch-size-avg
Alerting StrategySymptoms first, then causes; page on customer impact
PrerequisitesParts 1-13 (cluster, producers, consumers)

What You'll Learn

  • Essential broker, producer, and consumer metrics
  • Setting up Prometheus + Grafana monitoring stack
  • Consumer lag monitoring with dedicated tools
  • Alerting strategies that reduce noise
  • Operational runbooks for common issues
  • Capacity planning and performance tuning
  • Spring Boot Kafka metrics integration

Production Story: The Silent 2-Hour Lag

Wednesday, 3:00 PM. Black Friday preparation.

"Why are customers seeing stale inventory?" The e-commerce team was confused. Their real-time inventory system showed items in stock that had sold out hours ago.

The investigation revealed a consumer processing inventory updates was 2 hours behind. The consumer was running, heartbeating, not throwing errors. From the application's perspective, everything was fine.
The root cause? A downstream database had slowed down due to increased load. Each message took 500ms instead of 50ms. The consumer kept processing, just 10x slower. No errors, no restarts—just steadily falling behind.
The missing piece: Consumer lag monitoring. The team had metrics for errors, restarts, and throughput—but not for how far behind consumers were from the latest produced messages.
After adding lag monitoring with Burrow and proper alerting, they caught similar issues within minutes, not hours. The lesson: a healthy-looking consumer can be silently drowning in lag.

Mental Model: The Observability Triangle

                    ┌─────────────────────────────────────────┐
                    │         KAFKA OBSERVABILITY             │
                    └─────────────────────────────────────────┘

    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
    │     METRICS     │    │      LOGS       │    │     TRACES      │
    │                 │    │                 │    │                 │
    │  • Counters     │    │  • Broker logs  │    │  • End-to-end   │
    │  • Gauges       │    │  • App logs     │    │  • Correlation  │
    │  • Histograms   │    │  • Audit logs   │    │  • Latency      │
    └────────┬────────┘    └────────┬────────┘    └────────┬────────┘
             │                      │                      │
             └──────────────────────┼──────────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
                    ▼                               ▼
           ┌───────────────┐              ┌───────────────┐
           │   DASHBOARDS  │              │    ALERTS     │
           │               │              │               │
           │  • Grafana    │              │  • PagerDuty  │
           │  • Datadog    │              │  • OpsGenie   │
           │  • Kibana     │              │  • Slack      │
           └───────────────┘              └───────────────┘

The Metrics Pipeline

┌──────────────────────────────────────────────────────────────────────────┐
│                        METRICS COLLECTION FLOW                           │
└──────────────────────────────────────────────────────────────────────────┘

  KAFKA CLUSTER                      EXPORTERS                    STORAGE
┌─────────────────┐              ┌─────────────────┐         ┌─────────────┐
│                 │              │                 │         │             │
│  Broker 1 (JMX) │─────────────▶│  JMX Exporter   │────────▶│             │
│  :9999          │              │  :7071          │         │  Prometheus │
│                 │              │                 │         │             │
├─────────────────┤              ├─────────────────┤         │             │
│                 │              │                 │         │  (scrapes   │
│  Broker 2 (JMX) │─────────────▶│  JMX Exporter   │────────▶│   every     │
│  :9999          │              │  :7071          │         │   15-30s)   │
│                 │              │                 │         │             │
├─────────────────┤              ├─────────────────┤         │             │
│                 │              │                 │         │             │
│  Broker 3 (JMX) │─────────────▶│  JMX Exporter   │────────▶│             │
│  :9999          │              │  :7071          │         │             │
└─────────────────┘              └─────────────────┘         └──────┬──────┘
                                                                    │
  APPLICATIONS                                                      │
┌─────────────────┐                                                 │
│  Spring Boot    │                                                 │
│  App            │─────────────────────────────────────────────────┤
│  :8080/actuator │                                                 │
│  /prometheus    │                                                 │
└─────────────────┘                                                 │
                                                                    ▼
  LAG MONITORING                                              ┌─────────────┐
┌─────────────────┐                                           │   Grafana   │
│                 │                                           │             │
│  Burrow /       │───────────────────────────────────────────│  Dashboards │
│  Kafka Lag      │                                           │  & Alerts   │
│  Exporter       │                                           │             │
└─────────────────┘                                           └─────────────┘

Deep Dive: Broker Metrics

Critical Broker Metrics

These metrics indicate cluster health. If these are wrong, everything is wrong.
┌──────────────────────────────────────────────────────────────────────────┐
│                    CRITICAL BROKER METRICS                               │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Metric                         │ What It Means                           │
├────────────────────────────────┼─────────────────────────────────────────┤
│ UnderReplicatedPartitions      │ Followers falling behind leaders        │
│                                │ ⚠️ >0 for extended time = replication   │
│                                │    issues or broker problems            │
├────────────────────────────────┼─────────────────────────────────────────┤
│ OfflinePartitionsCount         │ Partitions with no leader               │
│                                │ 🚨 >0 = DATA UNAVAILABLE                │
├────────────────────────────────┼─────────────────────────────────────────┤
│ ActiveControllerCount          │ Number of active controllers            │
│                                │ ✅ Should be exactly 1 cluster-wide      │
├────────────────────────────────┼─────────────────────────────────────────┤
│ UncleanLeaderElectionsPerSec   │ Elections from out-of-sync replicas     │
│                                │ 🚨 >0 = POTENTIAL DATA LOSS             │
├────────────────────────────────┼─────────────────────────────────────────┤
│ IsrShrinksPerSec /             │ Rate of ISR changes                     │
│ IsrExpandsPerSec               │ ⚠️ High rate = unstable replication     │
└────────────────────────────────┴─────────────────────────────────────────┘

JMX Exporter Configuration

YAML(55 lines)
Code
Loading syntax highlighter...

Broker Startup with JMX Exporter

BASH(5 lines)
Code
Loading syntax highlighter...

Deep Dive: Producer Metrics

Spring Kafka Producer Metrics

JAVA(32 lines)
Code
Loading syntax highlighter...

Critical Producer Metrics

┌──────────────────────────────────────────────────────────────────────────┐
│                     PRODUCER METRICS TO MONITOR                          │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Metric                         │ What It Means                           │
├────────────────────────────────┼─────────────────────────────────────────┤
│ record-send-rate               │ Records sent per second                 │
│                                │ 📊 Baseline for throughput               │
├────────────────────────────────┼─────────────────────────────────────────┤
│ record-error-rate              │ Failed sends per second                 │
│                                │ 🚨 >0 sustained = problem               │
├────────────────────────────────┼─────────────────────────────────────────┤
│ record-retry-rate              │ Retries per second                      │
│                                │ ⚠️ High rate = network/broker issues    │
├────────────────────────────────┼─────────────────────────────────────────┤
│ request-latency-avg            │ Average request time to broker          │
│                                │ 📊 Monitor for degradation               │
├────────────────────────────────┼─────────────────────────────────────────┤
│ batch-size-avg                 │ Average batch size in bytes             │
│                                │ 📊 Tune linger.ms/batch.size             │
├────────────────────────────────┼─────────────────────────────────────────┤
│ buffer-available-bytes         │ Free buffer memory                      │
│                                │ 🚨 Near 0 = producer blocked            │
├────────────────────────────────┼─────────────────────────────────────────┤
│ waiting-threads                │ Threads blocked on buffer               │
│                                │ 🚨 >0 = buffer exhaustion               │
└────────────────────────────────┴─────────────────────────────────────────┘

Custom Producer Metrics Service

JAVA(88 lines)
Code
Loading syntax highlighter...

Deep Dive: Consumer Metrics

The Most Important Metric: Consumer Lag

┌──────────────────────────────────────────────────────────────────────────┐
│                         CONSUMER LAG EXPLAINED                           │
└──────────────────────────────────────────────────────────────────────────┘

  Topic: orders, Partition: 0

  Offsets:     0   1   2   3   4   5   6   7   8   9   10  11  12
              ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  Messages:   │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ✓ │ ░ │ ░ │ ░ │ ░ │ ░ │   │
              └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                                        ▲                       ▲
                                        │                       │
                              Consumer Offset = 7        Log End Offset = 12
                              (last committed)           (latest produced)

                              └───────────────┬───────────────┘
                                              │
                                     CONSUMER LAG = 5
                                     (messages behind)

  ✓ = Processed and committed
  ░ = Produced but not yet consumed

Spring Kafka Consumer Metrics

JAVA(42 lines)
Code
Loading syntax highlighter...

Critical Consumer Metrics

┌──────────────────────────────────────────────────────────────────────────┐
│                     CONSUMER METRICS TO MONITOR                          │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Metric                         │ What It Means                           │
├────────────────────────────────┼─────────────────────────────────────────┤
│ records-lag-max                │ Maximum lag across all partitions       │
│                                │ 🚨 Growing = falling behind             │
├────────────────────────────────┼─────────────────────────────────────────┤
│ records-consumed-rate          │ Records consumed per second             │
│                                │ 📊 Compare with producer rate            │
├────────────────────────────────┼─────────────────────────────────────────┤
│ fetch-latency-avg              │ Time to fetch from broker               │
│                                │ 📊 Network/broker health                 │
├────────────────────────────────┼─────────────────────────────────────────┤
│ commit-rate                    │ Commits per second                      │
│                                │ 📊 Should match consumption pattern      │
├────────────────────────────────┼─────────────────────────────────────────┤
│ rebalance-rate-per-hour        │ Rebalances per hour                     │
│                                │ ⚠️ >1-2/hour = instability              │
├────────────────────────────────┼─────────────────────────────────────────┤
│ last-poll-seconds-ago          │ Time since last poll                    │
│                                │ 🚨 > max.poll.interval.ms = will kick   │
└────────────────────────────────┴─────────────────────────────────────────┘

Custom Lag Monitoring Service

JAVA(94 lines)
Code
Loading syntax highlighter...

Deep Dive: Dedicated Lag Monitoring Tools

Burrow Setup

Burrow is LinkedIn's open-source consumer lag monitoring tool.

YAML(52 lines)
Code
Loading syntax highlighter...

Burrow API Usage

JAVA(72 lines)
Code
Loading syntax highlighter...

Kafka Lag Exporter (Prometheus-native)

YAML(11 lines)
Code
Loading syntax highlighter...
HOCON(30 lines)
Code
Loading syntax highlighter...

Deep Dive: Spring Boot Actuator Integration

Complete Metrics Configuration

YAML(34 lines)
Code
Loading syntax highlighter...

Health Indicator for Kafka

JAVA(93 lines)
Code
Loading syntax highlighter...

Deep Dive: Alerting Strategy

The Alerting Pyramid

┌──────────────────────────────────────────────────────────────────────────┐
│                        ALERTING PYRAMID                                  │
└──────────────────────────────────────────────────────────────────────────┘

                              ┌─────────┐
                              │  PAGE   │  ← Customer impact
                              │ (Wake)  │    Immediate action needed
                              └────┬────┘
                                   │
                         ┌─────────┴─────────┐
                         │      TICKET       │  ← Degradation
                         │   (Next shift)    │    Needs attention soon
                         └─────────┬─────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │           SLACK             │  ← Warning signs
                    │     (Informational)         │    Worth knowing
                    └──────────────┬──────────────┘
                                   │
           ┌───────────────────────┴───────────────────────┐
           │                 DASHBOARD                     │  ← Everything
           │              (Always visible)                 │    For investigation
           └───────────────────────────────────────────────┘

Prometheus Alerting Rules

YAML(102 lines)
Code
Loading syntax highlighter...

AlertManager Configuration

YAML(55 lines)
Code
Loading syntax highlighter...

Deep Dive: Grafana Dashboards

Key Dashboard Panels (JSON)

JSON(81 lines)
Code
Loading syntax highlighter...

Deep Dive: Operational Runbooks

Runbook: High Consumer Lag

┌──────────────────────────────────────────────────────────────────────────┐
│              RUNBOOK: HIGH CONSUMER LAG                                  │
└──────────────────────────────────────────────────────────────────────────┘

SYMPTOMS:
- Consumer lag metric > threshold (e.g., 100,000 messages)
- Processing latency increasing
- Data freshness complaints

DIAGNOSIS STEPS:

1. Check if consumer is running
   $ kafka-consumer-groups.sh --bootstrap-server <broker> \
       --describe --group <group-id>

   Look for:
   - CONSUMER-ID column: Should show active consumers
   - LAG column: Per-partition lag
   - LOG-END-OFFSET vs CURRENT-OFFSET: Gap = lag

2. Check consumer health
   $ curl http://<app>:8080/actuator/health

   Look for:
   - Kafka health indicator status
   - Thread pool exhaustion
   - Memory pressure

3. Check processing rate
   $ curl http://<app>:8080/actuator/prometheus | grep kafka_consumer

   Look for:
   - records-consumed-rate: Should be > 0
   - poll-rate: Should be consistent
   - commit-rate: Should be > 0

4. Check for rebalancing
   $ kafka-consumer-groups.sh --bootstrap-server <broker> \
       --describe --group <group-id> --state

   Look for:
   - State: Should be "Stable", not "PreparingRebalance"

5. Check downstream dependencies
   - Database response times
   - External API latency
   - Queue depths

REMEDIATION:

If consumer stopped:
- Check logs for exceptions
- Restart consumer application
- Verify connectivity to Kafka

If processing slow:
- Increase consumer concurrency (more partitions/consumers)
- Optimize processing logic
- Scale downstream services
- Consider batch processing

If continuous rebalancing:
- Increase session.timeout.ms
- Increase max.poll.interval.ms
- Reduce max.poll.records
- Check for consumer crashes

If producer rate spike:
- Scale consumer instances
- Enable consumer batching
- Consider backpressure

ESCALATION:
- If lag continues growing after 30 minutes: Page on-call
- If data loss suspected: Notify data team

Runbook: Under-Replicated Partitions

┌──────────────────────────────────────────────────────────────────────────┐
│              RUNBOOK: UNDER-REPLICATED PARTITIONS                        │
└──────────────────────────────────────────────────────────────────────────┘

SYMPTOMS:
- UnderReplicatedPartitions metric > 0
- ISRShrinkRate increasing
- Produce latency may increase (with acks=all)

DIAGNOSIS STEPS:

1. Identify affected partitions
   $ kafka-topics.sh --bootstrap-server <broker> \
       --describe --under-replicated-partitions

2. Check broker health
   $ kafka-broker-api-versions.sh --bootstrap-server <broker>

   For each broker, check:
   - Is broker responding?
   - Response time normal?

3. Check broker logs
   $ tail -f /var/log/kafka/server.log | grep -E "(ERROR|WARN)"

   Look for:
   - Disk full errors
   - Network timeouts
   - OutOfMemory errors

4. Check disk space
   $ df -h /var/kafka-logs

   Should have > 20% free

5. Check replication lag
   $ kafka-replica-verification.sh --broker-list <brokers> \
       --topics-include ".*"

6. Check network between brokers
   $ ping <other-broker>
   $ nc -zv <other-broker> 9092

REMEDIATION:

If disk full:
- Delete old log segments (if retention allows)
- Expand disk
- Move partitions to other brokers

If broker down:
- Restart broker
- Check for hardware issues
- Replace broker if needed

If network issues:
- Check firewall rules
- Verify DNS resolution
- Contact network team

If broker overloaded:
- Reassign partitions to spread load
- Add more brokers
- Increase broker resources

VERIFICATION:
- UnderReplicatedPartitions should return to 0
- ISRExpandRate should spike (replicas catching up)
- All partitions should show full ISR

ESCALATION:
- If affects > 10% of partitions: Page on-call
- If not resolved in 15 minutes: Page platform lead

Deep Dive: Capacity Planning

Sizing Formula

┌──────────────────────────────────────────────────────────────────────────┐
│                     CAPACITY PLANNING FORMULAS                           │
└──────────────────────────────────────────────────────────────────────────┘

DISK CAPACITY:

  Daily data = (messages/second) × (avg message size) × 86400 seconds

  Total disk = Daily data × retention days × replication factor × 1.2 (overhead)

  Example:
    1000 msg/s × 1KB × 86400 = 86.4 GB/day
    86.4 GB × 7 days × 3 replicas × 1.2 = 2.18 TB total


NETWORK BANDWIDTH:

  Inbound = messages/second × avg message size × replication factor
  Outbound = messages/second × avg message size × consumer count

  Example:
    Inbound: 1000 msg/s × 1KB × 3 = 3 MB/s
    Outbound: 1000 msg/s × 1KB × 5 consumers = 5 MB/s


PARTITION COUNT:

  partitions = max(
    throughput_required / throughput_per_partition,
    consumer_instances
  )

  Rule of thumb: 1 partition can handle ~10 MB/s

  Example:
    50 MB/s throughput needed, 10 consumers
    partitions = max(50/10, 10) = 10 partitions


BROKER COUNT:

  brokers = max(
    total_partitions × replication_factor / partitions_per_broker,
    total_disk_needed / disk_per_broker,
    total_network / network_per_broker
  )

  Rule of thumb: 4000 partitions per broker max

Capacity Monitoring

JAVA(57 lines)
Code
Loading syntax highlighter...

Common Mistakes

1. Alerting on Symptoms, Not Causes

❌ WRONG: Alert on every metric deviation
   - Alert: "BytesIn increased 50%"
   - Alert: "Request latency up 20%"
   - Result: Alert fatigue, ignored alerts

✅ RIGHT: Alert on customer impact
   - Alert: "Consumer lag > threshold" (freshness impact)
   - Alert: "Offline partitions" (availability impact)
   - Alert: "Producer error rate > 1%" (data loss risk)

2. Missing Consumer Lag Monitoring

❌ WRONG: Only monitor application health
   - Consumer health: UP
   - No errors in logs
   - Result: 2 hours behind, nobody knows

✅ RIGHT: Dedicated lag monitoring
   - Burrow or Kafka Lag Exporter
   - Per-partition lag visibility
   - Alerts on growing lag, not just high lag

3. Too Many Dashboards, No Hierarchy

❌ WRONG: One giant dashboard with everything
   - 50 panels
   - Takes 30 seconds to load
   - Nobody knows where to look

✅ RIGHT: Dashboard hierarchy
   - L1: Executive overview (4-6 panels)
   - L2: Service-level view (topic/group focused)
   - L3: Deep-dive troubleshooting

4. Not Monitoring the Monitoring

❌ WRONG: Assume monitoring always works
   - JMX exporter crashes
   - Prometheus storage full
   - Nobody knows until outage

✅ RIGHT: Meta-monitoring
   - Alert on missing metrics
   - Prometheus storage alerts
   - Exporter health checks

Debug This: The Invisible Consumer

A consumer group shows as "Stable" with active members, but messages are piling up:

$ kafka-consumer-groups.sh --describe --group order-processor

GROUP           TOPIC     PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-processor orders    0          1000            50000           49000
order-processor orders    1          1000            50000           49000
order-processor orders    2          1000            50000           49000

CONSUMER-ID                                    HOST
order-consumer-1-xxx                           /10.0.0.5
order-consumer-2-xxx                           /10.0.0.6
order-consumer-3-xxx                           /10.0.0.7
What's happening? (Answer below)
Answer
The Diagnosis:

Look at CURRENT-OFFSET: It's exactly 1000 for all partitions. This means:

  1. Consumers connected and got initial offsets
  2. Offsets haven't moved since
Possible causes:
  1. Processing blocked - First message throws exception, consumer stuck retrying:
    JAVA(2 lines)
    Code
    Loading syntax highlighter...
  2. Commits failing silently - Consumer processes but can't commit:
    JAVA(2 lines)
    Code
    Loading syntax highlighter...
  3. Wrong deserializer - Messages fail to deserialize:
    JAVA(2 lines)
    Code
    Loading syntax highlighter...
  4. Consumer paused - Deliberately paused and never resumed:
    JAVA
    Code
    Loading syntax highlighter...
How to investigate:
BASH(10 lines)
Code
Loading syntax highlighter...
The fix depends on cause, but often: the consumer is stuck on a poison message. Add a dead letter queue or skip mechanism.

Exercises

Exercise 1: Build a Lag Dashboard

Create a Grafana dashboard that shows:

  • Total lag per consumer group (gauge)
  • Lag trend over time (graph)
  • Lag by partition heatmap
  • Alert thresholds visually marked

Exercise 2: Implement Custom Health Check

Build a Spring Boot health indicator that:

  • Checks all assigned partitions
  • Verifies consumer is within N messages of head
  • Reports partition-level details
  • Returns DOWN if any partition has critical lag

Exercise 3: Create Alerting Rules

Design Prometheus alerting rules for:

  • Consumer stopped (no commits in 10 minutes)
  • Producer backpressure (buffer utilization > 80%)
  • Broker disk usage > 75%
  • Request latency P99 > 500ms

Exercise 4: Build Capacity Report

Create a scheduled job that:

  • Calculates current disk usage per topic
  • Projects when retention will be exceeded
  • Reports partition distribution across brokers
  • Warns on imbalanced partitions

Exercise 5: Operational Runbook

Write a runbook for:

  • Consumer repeatedly rebalancing
  • Include diagnostic commands
  • Decision tree for root causes
  • Remediation steps for each cause

Interview Questions

Q1: "How would you detect a slow consumer before it impacts users?"

What they're looking for: Understanding of proactive monitoring
Strong answer: "I'd implement multiple detection layers:
  1. Lag velocity monitoring - Alert not just on high lag, but on growing lag:
    deriv(kafka_consumer_lag[15m]) > threshold
    

    This catches problems before lag becomes critical.

  2. Processing rate comparison - Compare consumer throughput to producer:
    rate(consumed) < rate(produced) * 0.8
    

    If consuming less than 80% of production rate, we're falling behind.

  3. Commit frequency - Monitor commits per minute:
    rate(commits) < expected_rate
    

    Sudden drop indicates processing slowdown.

  4. End-to-end latency - Measure time from produce to consume complete:
    • Embed timestamp in message
    • Measure on consumption
    • Alert on P99 latency increase
The key is detecting the trend before it becomes a problem, not waiting for absolute thresholds."

Q2: "Your monitoring shows UnderReplicatedPartitions > 0 but no offline partitions. What do you check?"

What they're looking for: Systematic troubleshooting approach
Strong answer: "Under-replicated without offline means leaders are healthy but followers are behind. I'd check:
  1. Which broker's replicas are behind:
    BASH
    Code
    Loading syntax highlighter...

    If all from one broker = that broker has issues.

  2. Broker resource constraints:
    • Disk I/O: Follower can't write fast enough
    • Network: Can't fetch from leader fast enough
    • CPU: If compressed, decompression bottleneck
  3. Replication lag metric:
    kafka_server_replica_fetcher_max_lag
    

    Shows how far behind in bytes.

  4. Network between brokers:
    • Latency spikes
    • Packet loss
    • Bandwidth saturation
  5. Leader distribution:
    • One broker leader-heavy = unbalanced fetch load
    • Run preferred replica election if needed.

If it's transient (< 5 minutes) during rolling restart or deployment, usually okay. Sustained means real issue."

Q3: "How do you size a Kafka cluster for a new use case?"

What they're looking for: Capacity planning methodology
Strong answer: "I'd gather requirements and calculate:
Input requirements:
  • Expected message throughput (msg/sec)
  • Average message size
  • Peak vs. average ratio
  • Retention period
  • Replication factor
  • Consumer count
Calculations:
Daily volume = throughput × msg_size × 86400
Total storage = daily × retention × RF × 1.5 (safety margin)

Network in = throughput × msg_size × RF
Network out = throughput × msg_size × consumers

Partitions = max(
  target_throughput / 10MB_per_partition,
  consumer_parallelism
)
Broker sizing:
  • Start with 3 brokers (minimum for HA)
  • Each broker: 4000 partitions max
  • Consider: disk IOPS, network bandwidth, memory for page cache
I'd also:
  • Provision 30% overhead for spikes
  • Plan for 2x growth in 12 months
  • Set up monitoring from day 1 to validate assumptions

Then iterate based on actual metrics."

Q4: "What metrics would you look at during a Kafka outage?"

What they're looking for: Incident response prioritization
Strong answer: "In order of priority:
Tier 1 - Availability (first 30 seconds):
  • OfflinePartitionsCount: Any data unavailable?
  • ActiveControllerCount: Is there a controller?
  • Broker up/down status
Tier 2 - Scope (next minute):
  • Which topics/partitions affected?
  • Which consumer groups impacted?
  • UnderReplicatedPartitions trend
Tier 3 - Root cause (next 5 minutes):
  • Recent deployments or changes
  • Broker logs (OutOfMemory, disk full, network)
  • ISR shrink/expand rate (replication issues)
  • Request latency by type (Produce, Fetch)
Tier 4 - Impact assessment:
  • Producer error rates (data loss risk)
  • Consumer lag (freshness impact)
  • Downstream service health
I'd also check the timing - did it start at a round time (cron job), deployment time (change), or random (hardware/network)?"

Q5: "How would you implement end-to-end latency monitoring for Kafka?"

What they're looking for: Understanding of distributed tracing
Strong answer: "End-to-end latency requires correlation from producer to final consumer processing:
Option 1: Message headers
JAVA(8 lines)
Code
Loading syntax highlighter...
Option 2: Interceptors
JAVA(7 lines)
Code
Loading syntax highlighter...
Option 3: Distributed tracing (OpenTelemetry)
  • Propagate trace context in headers
  • Spans for produce, broker, consume, process
  • Visualize in Jaeger/Zipkin
Metrics to track:
  • P50, P95, P99 end-to-end latency
  • Breakdown: network vs. queue time vs. processing
  • Per-topic and per-consumer-group

The challenge is clock synchronization - use NTP or embed offsets rather than absolute times when possible."

Summary & Key Takeaways

The Monitoring Hierarchy

┌──────────────────────────────────────────────────────────────────────────┐
│                    KAFKA MONITORING PRIORITIES                           │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Priority                       │ Metrics                                 │
├────────────────────────────────┼─────────────────────────────────────────┤
│ 1. Availability                │ OfflinePartitions, ActiveController     │
│    (Is data accessible?)       │                                         │
├────────────────────────────────┼─────────────────────────────────────────┤
│ 2. Durability                  │ UnderReplicated, ISR changes            │
│    (Will data survive?)        │                                         │
├────────────────────────────────┼─────────────────────────────────────────┤
│ 3. Freshness                   │ Consumer lag, processing rate           │
│    (Is data current?)          │                                         │
├────────────────────────────────┼─────────────────────────────────────────┤
│ 4. Throughput                  │ Bytes in/out, request rate              │
│    (Is capacity sufficient?)   │                                         │
├────────────────────────────────┼─────────────────────────────────────────┤
│ 5. Latency                     │ Request times, E2E latency              │
│    (Is it fast enough?)        │                                         │
└────────────────────────────────┴─────────────────────────────────────────┘

Essential Takeaways

  1. Consumer lag is king - A healthy-looking consumer can be hours behind. Always monitor lag.
  2. Alert on symptoms, not causes - Page on customer impact (offline partitions, growing lag), not internal metrics (ISR shrinks).
  3. Use dedicated lag tools - Burrow or Kafka Lag Exporter provide better visibility than application metrics alone.
  4. Build dashboard hierarchy - Executive overview → Service view → Deep-dive troubleshooting.
  5. Runbooks save incidents - Document diagnostic steps and remediation for common issues before they happen.
  6. Capacity planning is continuous - Monitor trends and project forward, don't wait for alerts.

Quick Reference

┌──────────────────────────────────────────────────────────────────────────┐
│                      MONITORING QUICK REFERENCE                          │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Tool                           │ Purpose                                 │
├────────────────────────────────┼─────────────────────────────────────────┤
│ JMX Exporter                   │ Broker metrics → Prometheus             │
│ kafka-lag-exporter             │ Consumer lag → Prometheus               │
│ Burrow                         │ Consumer lag + health analysis          │
│ Micrometer                     │ Spring app metrics → Prometheus         │
│ Grafana                        │ Visualization and alerting              │
├────────────────────────────────┼─────────────────────────────────────────┤
│ Key Commands                   │                                         │
├────────────────────────────────┼─────────────────────────────────────────┤
│ Consumer group status          │ kafka-consumer-groups.sh --describe     │
│ Under-replicated partitions    │ kafka-topics.sh --under-replicated      │
│ Topic details                  │ kafka-topics.sh --describe              │
│ Broker API check               │ kafka-broker-api-versions.sh            │
└────────────────────────────────┴─────────────────────────────────────────┘

Series Navigation

Series Overview: Kafka Compendium Series