Devops

Cheatsheet & Decision Guide

At a Glance

This is your quick reference for everything covered in the Kafka Compendium series. Bookmark this page for:

  • Configuration cheatsheets
  • CLI command reference
  • Decision trees for common choices
  • Troubleshooting guides
  • Spring Kafka quick reference
  • Production checklist

Producer Configuration Cheatsheet

Essential Producer Settings

PROPERTIES(28 lines)
Code
Loading syntax highlighter...

Producer Profiles

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCER CONFIGURATION PROFILES                       │
├─────────────────────┬─────────────────────┬──────────────────────────────┤
│ Setting             │ High Throughput     │ High Reliability             │
├─────────────────────┼─────────────────────┼──────────────────────────────┤
│ acks                │ 1                   │ all                          │
│ batch.size          │ 65536 (64KB)        │ 16384 (16KB)                 │
│ linger.ms           │ 20-100              │ 0-5                          │
│ compression.type    │ lz4 or zstd         │ lz4                          │
│ buffer.memory       │ 67108864 (64MB)     │ 33554432 (32MB)              │
│ enable.idempotence  │ false (optional)    │ true                         │
│ retries             │ 3                   │ MAX_INT                      │
└─────────────────────┴─────────────────────┴──────────────────────────────┘

Consumer Configuration Cheatsheet

Essential Consumer Settings

PROPERTIES(33 lines)
Code
Loading syntax highlighter...

Consumer Timeout Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER TIMEOUT GUIDE                                │
└──────────────────────────────────────────────────────────────────────────┘

  session.timeout.ms = 45000 (45s)
  ├── Heartbeat-based detection
  ├── Consumer marked dead if no heartbeat
  └── Triggers rebalance

  heartbeat.interval.ms = 3000 (3s)
  ├── Should be 1/3 of session.timeout
  └── Background thread, not affected by processing

  max.poll.interval.ms = 300000 (5min)
  ├── Processing-based detection
  ├── Consumer marked dead if poll() not called
  └── Set based on max processing time

  RULE: max.poll.interval.ms > max_processing_time_per_batch
        session.timeout.ms ≈ 3 × heartbeat.interval.ms

Kafka Streams Configuration Cheatsheet

PROPERTIES(29 lines)
Code
Loading syntax highlighter...

CLI Commands Reference

Topic Management

BASH(26 lines)
Code
Loading syntax highlighter...

Consumer Groups

BASH(32 lines)
Code
Loading syntax highlighter...

Console Producer/Consumer

BASH(17 lines)
Code
Loading syntax highlighter...

ACLs

BASH(22 lines)
Code
Loading syntax highlighter...

Decision Trees

When to Use Kafka vs Alternatives

┌──────────────────────────────────────────────────────────────────────────┐
│                    MESSAGING SYSTEM DECISION                             │
└──────────────────────────────────────────────────────────────────────────┘

Start: What's your primary use case?

├── Point-to-point messaging with routing?
│   └── Consider: RabbitMQ, ActiveMQ
│
├── Real-time event streaming with replay?
│   └── Consider: KAFKA ✓
│
├── Pub/sub with message filtering?
│   ├── Need replay/persistence? → KAFKA ✓
│   └── Fire-and-forget ok? → Redis Pub/Sub, SNS
│
├── Stream processing with state?
│   └── Consider: Kafka Streams, Flink, Spark
│
└── Simple task queue?
    └── Consider: Redis, SQS, Celery

Partition Count Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    HOW MANY PARTITIONS?                                  │
└──────────────────────────────────────────────────────────────────────────┘

Step 1: Determine parallelism requirement
        max_consumers = max(current_consumers, expected_consumers)

Step 2: Calculate throughput requirement
        partitions_for_throughput = target_throughput / throughput_per_partition
        (Rule of thumb: 10 MB/s per partition)

Step 3: Consider ordering requirements
        If strict ordering needed: partitions = 1 (per ordering key)
        If ordering by key: partitions can be higher

Step 4: Final calculation
        partitions = max(max_consumers, partitions_for_throughput)

Step 5: Add growth buffer
        final_partitions = partitions × 1.5

CONSTRAINTS:
• Max ~4000 partitions per broker
• More partitions = longer recovery time
• Can only increase, not decrease
• Start conservative, increase later

Delivery Guarantee Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    DELIVERY GUARANTEE CHOICE                             │
└──────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────┐
                    │ Can you lose        │
                    │ messages?           │
                    └──────────┬──────────┘
                               │
             ┌─────────────────┴─────────────────┐
             │ YES                               │ NO
             ▼                                   ▼
    ┌─────────────────┐               ┌─────────────────┐
    │ AT-MOST-ONCE    │               │ Can consumer    │
    │                 │               │ handle          │
    │ acks=0          │               │ duplicates?     │
    │ Fast, may lose  │               └────────┬────────┘
    └─────────────────┘                        │
                                 ┌─────────────┴─────────────┐
                                 │ YES                       │ NO
                                 ▼                           ▼
                        ┌─────────────────┐        ┌─────────────────┐
                        │ AT-LEAST-ONCE   │        │ EXACTLY-ONCE    │
                        │                 │        │                 │
                        │ acks=all        │        │ Transactions +  │
                        │ retries=MAX     │        │ Idempotent      │
                        │ May duplicate   │        │ consumer        │
                        └─────────────────┘        └─────────────────┘

KStream vs KTable Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    KSTREAM VS KTABLE                                     │
└──────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────┐
                    │ Do you need every   │
                    │ record/event?       │
                    └──────────┬──────────┘
                               │
             ┌─────────────────┴─────────────────┐
             │ YES                               │ NO (only latest)
             ▼                                   ▼
    ┌─────────────────┐               ┌─────────────────┐
    │    KSTREAM      │               │    KTABLE       │
    │                 │               │                 │
    │ • Event log     │               │ • Current state │
    │ • Transactions  │               │ • Reference data│
    │ • Actions       │               │ • Aggregations  │
    │ • Notifications │               │ • Lookups       │
    └─────────────────┘               └─────────────────┘

Spring Kafka Quick Reference

Basic Configuration

YAML(19 lines)
Code
Loading syntax highlighter...

Common Annotations

JAVA(34 lines)
Code
Loading syntax highlighter...

Sending Messages

JAVA(25 lines)
Code
Loading syntax highlighter...

Troubleshooting Guide

Consumer Lag Growing

┌──────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER LAG GROWING                                  │
└──────────────────────────────────────────────────────────────────────────┘

1. Check if consumer is running
   $ kafka-consumer-groups.sh --describe --group my-group
   Look for: CONSUMER-ID column

2. Check consumer health
   $ curl http://app:8080/actuator/health/kafka

3. Check processing rate
   - Compare records-consumed-rate vs production rate
   - If consumption < production → need more consumers or faster processing

4. Check for rebalancing
   - State should be "Stable"
   - Frequent rebalances → unstable consumers

5. Check downstream dependencies
   - Database slow?
   - External API latency?

SOLUTIONS:
• Add more consumers (up to partition count)
• Increase max.poll.records
• Optimize processing logic
• Scale downstream services
• Consider async processing

Consumer Keeps Rebalancing

┌──────────────────────────────────────────────────────────────────────────┐
│                    FREQUENT REBALANCES                                   │
└──────────────────────────────────────────────────────────────────────────┘

SYMPTOMS:
• Consumer group state flips between "PreparingRebalance" and "Stable"
• Processing stops during rebalances
• Logs show "Revoking previously assigned partitions"

CAUSES & FIXES:

1. Processing too slow (max.poll.interval.ms exceeded)
   FIX: Increase max.poll.interval.ms
        OR reduce max.poll.records
        OR optimize processing

2. Consumer crashes/restarts
   FIX: Fix application stability
        Consider static membership (group.instance.id)

3. Session timeout too aggressive
   FIX: Increase session.timeout.ms
        Ensure heartbeat.interval.ms = session.timeout.ms / 3

4. JVM GC pauses
   FIX: Tune GC settings
        Increase heap
        Use G1GC

5. Network issues
   FIX: Check network stability
        Increase request.timeout.ms

Producer Send Failures

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCER SEND FAILURES                                │
└──────────────────────────────────────────────────────────────────────────┘

TimeoutException:
• Check broker connectivity
• Increase delivery.timeout.ms
• Check broker load

RecordTooLargeException:
• Message > max.message.bytes (default 1MB)
• Increase broker message.max.bytes
• OR compress messages
• OR reduce message size

BufferExhaustedException:
• buffer.memory full
• Increase buffer.memory
• OR slow down production
• Check for broker issues blocking sends

NotLeaderOrFollowerException:
• Partition leader changed
• Will auto-retry with enable.idempotence=true
• If persistent, check cluster health

AuthorizationException:
• Check ACLs
• Verify credentials

Production Checklist

Before Going Live

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCTION READINESS CHECKLIST                        │
└──────────────────────────────────────────────────────────────────────────┘

INFRASTRUCTURE:
□ Minimum 3 brokers for HA
□ Topics have replication.factor >= 3
□ min.insync.replicas = 2
□ unclean.leader.election.enable = false
□ Separate disks for data and logs
□ Monitoring (JMX/Prometheus) enabled

PRODUCER:
□ acks = all
□ enable.idempotence = true
□ retries = MAX_INT (or high number)
□ delivery.timeout.ms > expected latency
□ Error handling for send failures
□ Metrics collection

CONSUMER:
□ enable.auto.commit = false (manual control)
□ Appropriate max.poll.interval.ms
□ Error handling (retry + DLQ)
□ Idempotent processing
□ Consumer lag monitoring
□ Graceful shutdown handling

SECURITY:
□ SSL/TLS enabled
□ SASL authentication configured
□ ACLs for topics and groups
□ Encryption at rest (if needed)

OPERATIONS:
□ Monitoring dashboards
□ Alerting on key metrics
□ Runbooks for common issues
□ Backup/DR strategy
□ Capacity planning
□ Schema Registry (if using Avro/Protobuf)

Metrics to Monitor

┌──────────────────────────────────────────────────────────────────────────┐
│                    KEY METRICS                                           │
├────────────────────────────────────┬─────────────────────────────────────┤
│ BROKER                             │ ALERT WHEN                          │
├────────────────────────────────────┼─────────────────────────────────────┤
│ OfflinePartitionsCount             │ > 0                                 │
│ UnderReplicatedPartitions          │ > 0 for > 5 min                     │
│ ActiveControllerCount              │ != 1                                │
│ IsrShrinksPerSec                   │ Sustained high rate                 │
│ RequestQueueSize                   │ Growing                             │
├────────────────────────────────────┼─────────────────────────────────────┤
│ PRODUCER                           │                                     │
├────────────────────────────────────┼─────────────────────────────────────┤
│ record-error-rate                  │ > 0 sustained                       │
│ record-retry-rate                  │ High rate                           │
│ buffer-available-bytes             │ Near 0                              │
│ request-latency-avg                │ Increasing                          │
├────────────────────────────────────┼─────────────────────────────────────┤
│ CONSUMER                           │                                     │
├────────────────────────────────────┼─────────────────────────────────────┤
│ records-lag-max                    │ Growing                             │
│ records-consumed-rate              │ Dropping                            │
│ rebalance-rate-per-hour            │ > 1-2/hour                          │
│ commit-rate                        │ Dropping to 0                       │
└────────────────────────────────────┴─────────────────────────────────────┘

Quick Formulas

┌──────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY PLANNING FORMULAS                            │
└──────────────────────────────────────────────────────────────────────────┘

DISK SPACE:
daily_data = messages_per_second × avg_message_size × 86400
total_disk = daily_data × retention_days × replication_factor × 1.2

Example: 1000 msg/s × 1KB × 86400 × 7 days × 3 RF × 1.2 = ~2.2 TB

NETWORK:
inbound = messages_per_second × avg_message_size × replication_factor
outbound = messages_per_second × avg_message_size × consumer_count

PARTITIONS:
partitions = max(
    target_throughput ÷ 10MB_per_partition,
    max_consumer_instances
)

CONSUMER INSTANCES:
max_useful_instances = partition_count
(More instances than partitions = some idle)

Series Index

PartTitleKey Topics
0Introduction & Series OverviewKafka evolution, use cases
1Core Concepts & ArchitectureTopics, partitions, brokers
2Storage Engine Deep DiveLog segments, indexes
3Replication & Fault ToleranceISR, leaders, followers
4Cluster CoordinationKRaft vs ZooKeeper
5Producer InternalsBatching, compression
6Delivery Guaranteesacks, idempotence
7Advanced Producer PatternsPartitioners, transactions
8Consumer InternalsPoll loop, fetch
9Consumer GroupsRebalancing, assignment
10Offset ManagementCommit strategies
11Exactly-Once SemanticsEOS, transactions
12Schema RegistryAvro, compatibility
13SecurityAuthentication, ACLs
14Monitoring & OperationsMetrics, alerting
15Streams FundamentalsDSL, topologies
16State StoresRocksDB, queries
17Windowing & JoinsTime semantics
18Event-Driven PatternsSourcing, CQRS, Saga
19Error HandlingRetry, DLQ
20TestingUnit, integration
21CheatsheetThis page!

Series Navigation

Series Overview: Kafka Compendium Series

Thank you for following the Kafka Compendium series! If you found this helpful, consider sharing it with your team. For questions or feedback, reach out via the comments.