Cheatsheet & Decision Guide

At a Glance

This is your quick reference for everything covered in the Kafka Compendium series. Bookmark this page for:

Configuration cheatsheets
CLI command reference
Decision trees for common choices
Troubleshooting guides
Spring Kafka quick reference
Production checklist

Producer Configuration Cheatsheet

Essential Producer Settings

PROPERTIES(28 lines)
Code
Loading syntax highlighter...

Producer Profiles

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCER CONFIGURATION PROFILES                       │
├─────────────────────┬─────────────────────┬──────────────────────────────┤
│ Setting             │ High Throughput     │ High Reliability             │
├─────────────────────┼─────────────────────┼──────────────────────────────┤
│ acks                │ 1                   │ all                          │
│ batch.size          │ 65536 (64KB)        │ 16384 (16KB)                 │
│ linger.ms           │ 20-100              │ 0-5                          │
│ compression.type    │ lz4 or zstd         │ lz4                          │
│ buffer.memory       │ 67108864 (64MB)     │ 33554432 (32MB)              │
│ enable.idempotence  │ false (optional)    │ true                         │
│ retries             │ 3                   │ MAX_INT                      │
└─────────────────────┴─────────────────────┴──────────────────────────────┘

Consumer Configuration Cheatsheet

Essential Consumer Settings

PROPERTIES(33 lines)
Code
Loading syntax highlighter...

Consumer Timeout Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER TIMEOUT GUIDE                                │
└──────────────────────────────────────────────────────────────────────────┘

  session.timeout.ms = 45000 (45s)
  ├── Heartbeat-based detection
  ├── Consumer marked dead if no heartbeat
  └── Triggers rebalance

  heartbeat.interval.ms = 3000 (3s)
  ├── Should be 1/3 of session.timeout
  └── Background thread, not affected by processing

  max.poll.interval.ms = 300000 (5min)
  ├── Processing-based detection
  ├── Consumer marked dead if poll() not called
  └── Set based on max processing time

  RULE: max.poll.interval.ms > max_processing_time_per_batch
        session.timeout.ms ≈ 3 × heartbeat.interval.ms

Kafka Streams Configuration Cheatsheet

PROPERTIES(29 lines)
Code
Loading syntax highlighter...

CLI Commands Reference

Topic Management

BASH(26 lines)
Code
Loading syntax highlighter...

Consumer Groups

BASH(32 lines)
Code
Loading syntax highlighter...

Console Producer/Consumer

BASH(17 lines)
Code
Loading syntax highlighter...

ACLs

BASH(22 lines)
Code
Loading syntax highlighter...

Decision Trees

When to Use Kafka vs Alternatives

┌──────────────────────────────────────────────────────────────────────────┐
│                    MESSAGING SYSTEM DECISION                             │
└──────────────────────────────────────────────────────────────────────────┘

Start: What's your primary use case?

├── Point-to-point messaging with routing?
│   └── Consider: RabbitMQ, ActiveMQ
│
├── Real-time event streaming with replay?
│   └── Consider: KAFKA ✓
│
├── Pub/sub with message filtering?
│   ├── Need replay/persistence? → KAFKA ✓
│   └── Fire-and-forget ok? → Redis Pub/Sub, SNS
│
├── Stream processing with state?
│   └── Consider: Kafka Streams, Flink, Spark
│
└── Simple task queue?
    └── Consider: Redis, SQS, Celery

Partition Count Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    HOW MANY PARTITIONS?                                  │
└──────────────────────────────────────────────────────────────────────────┘

Step 1: Determine parallelism requirement
        max_consumers = max(current_consumers, expected_consumers)

Step 2: Calculate throughput requirement
        partitions_for_throughput = target_throughput / throughput_per_partition
        (Rule of thumb: 10 MB/s per partition)

Step 3: Consider ordering requirements
        If strict ordering needed: partitions = 1 (per ordering key)
        If ordering by key: partitions can be higher

Step 4: Final calculation
        partitions = max(max_consumers, partitions_for_throughput)

Step 5: Add growth buffer
        final_partitions = partitions × 1.5

CONSTRAINTS:
• Max ~4000 partitions per broker
• More partitions = longer recovery time
• Can only increase, not decrease
• Start conservative, increase later

Delivery Guarantee Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    DELIVERY GUARANTEE CHOICE                             │
└──────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────┐
                    │ Can you lose        │
                    │ messages?           │
                    └──────────┬──────────┘
                               │
             ┌─────────────────┴─────────────────┐
             │ YES                               │ NO
             ▼                                   ▼
    ┌─────────────────┐               ┌─────────────────┐
    │ AT-MOST-ONCE    │               │ Can consumer    │
    │                 │               │ handle          │
    │ acks=0          │               │ duplicates?     │
    │ Fast, may lose  │               └────────┬────────┘
    └─────────────────┘                        │
                                 ┌─────────────┴─────────────┐
                                 │ YES                       │ NO
                                 ▼                           ▼
                        ┌─────────────────┐        ┌─────────────────┐
                        │ AT-LEAST-ONCE   │        │ EXACTLY-ONCE    │
                        │                 │        │                 │
                        │ acks=all        │        │ Transactions +  │
                        │ retries=MAX     │        │ Idempotent      │
                        │ May duplicate   │        │ consumer        │
                        └─────────────────┘        └─────────────────┘

KStream vs KTable Decision

┌──────────────────────────────────────────────────────────────────────────┐
│                    KSTREAM VS KTABLE                                     │
└──────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────┐
                    │ Do you need every   │
                    │ record/event?       │
                    └──────────┬──────────┘
                               │
             ┌─────────────────┴─────────────────┐
             │ YES                               │ NO (only latest)
             ▼                                   ▼
    ┌─────────────────┐               ┌─────────────────┐
    │    KSTREAM      │               │    KTABLE       │
    │                 │               │                 │
    │ • Event log     │               │ • Current state │
    │ • Transactions  │               │ • Reference data│
    │ • Actions       │               │ • Aggregations  │
    │ • Notifications │               │ • Lookups       │
    └─────────────────┘               └─────────────────┘

Spring Kafka Quick Reference

Basic Configuration

YAML(19 lines)
Code
Loading syntax highlighter...

Common Annotations

JAVA(34 lines)
Code
Loading syntax highlighter...

Sending Messages

JAVA(25 lines)
Code
Loading syntax highlighter...

Troubleshooting Guide

Consumer Lag Growing

┌──────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER LAG GROWING                                  │
└──────────────────────────────────────────────────────────────────────────┘

1. Check if consumer is running
   $ kafka-consumer-groups.sh --describe --group my-group
   Look for: CONSUMER-ID column

2. Check consumer health
   $ curl http://app:8080/actuator/health/kafka

3. Check processing rate
   - Compare records-consumed-rate vs production rate
   - If consumption < production → need more consumers or faster processing

4. Check for rebalancing
   - State should be "Stable"
   - Frequent rebalances → unstable consumers

5. Check downstream dependencies
   - Database slow?
   - External API latency?

SOLUTIONS:
• Add more consumers (up to partition count)
• Increase max.poll.records
• Optimize processing logic
• Scale downstream services
• Consider async processing

Consumer Keeps Rebalancing

┌──────────────────────────────────────────────────────────────────────────┐
│                    FREQUENT REBALANCES                                   │
└──────────────────────────────────────────────────────────────────────────┘

SYMPTOMS:
• Consumer group state flips between "PreparingRebalance" and "Stable"
• Processing stops during rebalances
• Logs show "Revoking previously assigned partitions"

CAUSES & FIXES:

1. Processing too slow (max.poll.interval.ms exceeded)
   FIX: Increase max.poll.interval.ms
        OR reduce max.poll.records
        OR optimize processing

2. Consumer crashes/restarts
   FIX: Fix application stability
        Consider static membership (group.instance.id)

3. Session timeout too aggressive
   FIX: Increase session.timeout.ms
        Ensure heartbeat.interval.ms = session.timeout.ms / 3

4. JVM GC pauses
   FIX: Tune GC settings
        Increase heap
        Use G1GC

5. Network issues
   FIX: Check network stability
        Increase request.timeout.ms

Producer Send Failures

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCER SEND FAILURES                                │
└──────────────────────────────────────────────────────────────────────────┘

TimeoutException:
• Check broker connectivity
• Increase delivery.timeout.ms
• Check broker load

RecordTooLargeException:
• Message > max.message.bytes (default 1MB)
• Increase broker message.max.bytes
• OR compress messages
• OR reduce message size

BufferExhaustedException:
• buffer.memory full
• Increase buffer.memory
• OR slow down production
• Check for broker issues blocking sends

NotLeaderOrFollowerException:
• Partition leader changed
• Will auto-retry with enable.idempotence=true
• If persistent, check cluster health

AuthorizationException:
• Check ACLs
• Verify credentials

Production Checklist

Before Going Live

┌──────────────────────────────────────────────────────────────────────────┐
│                    PRODUCTION READINESS CHECKLIST                        │
└──────────────────────────────────────────────────────────────────────────┘

INFRASTRUCTURE:
□ Minimum 3 brokers for HA
□ Topics have replication.factor >= 3
□ min.insync.replicas = 2
□ unclean.leader.election.enable = false
□ Separate disks for data and logs
□ Monitoring (JMX/Prometheus) enabled

PRODUCER:
□ acks = all
□ enable.idempotence = true
□ retries = MAX_INT (or high number)
□ delivery.timeout.ms > expected latency
□ Error handling for send failures
□ Metrics collection

CONSUMER:
□ enable.auto.commit = false (manual control)
□ Appropriate max.poll.interval.ms
□ Error handling (retry + DLQ)
□ Idempotent processing
□ Consumer lag monitoring
□ Graceful shutdown handling

SECURITY:
□ SSL/TLS enabled
□ SASL authentication configured
□ ACLs for topics and groups
□ Encryption at rest (if needed)

OPERATIONS:
□ Monitoring dashboards
□ Alerting on key metrics
□ Runbooks for common issues
□ Backup/DR strategy
□ Capacity planning
□ Schema Registry (if using Avro/Protobuf)

Metrics to Monitor

┌──────────────────────────────────────────────────────────────────────────┐
│                    KEY METRICS                                           │
├────────────────────────────────────┬─────────────────────────────────────┤
│ BROKER                             │ ALERT WHEN                          │
├────────────────────────────────────┼─────────────────────────────────────┤
│ OfflinePartitionsCount             │ > 0                                 │
│ UnderReplicatedPartitions          │ > 0 for > 5 min                     │
│ ActiveControllerCount              │ != 1                                │
│ IsrShrinksPerSec                   │ Sustained high rate                 │
│ RequestQueueSize                   │ Growing                             │
├────────────────────────────────────┼─────────────────────────────────────┤
│ PRODUCER                           │                                     │
├────────────────────────────────────┼─────────────────────────────────────┤
│ record-error-rate                  │ > 0 sustained                       │
│ record-retry-rate                  │ High rate                           │
│ buffer-available-bytes             │ Near 0                              │
│ request-latency-avg                │ Increasing                          │
├────────────────────────────────────┼─────────────────────────────────────┤
│ CONSUMER                           │                                     │
├────────────────────────────────────┼─────────────────────────────────────┤
│ records-lag-max                    │ Growing                             │
│ records-consumed-rate              │ Dropping                            │
│ rebalance-rate-per-hour            │ > 1-2/hour                          │
│ commit-rate                        │ Dropping to 0                       │
└────────────────────────────────────┴─────────────────────────────────────┘

Quick Formulas

┌──────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY PLANNING FORMULAS                            │
└──────────────────────────────────────────────────────────────────────────┘

DISK SPACE:
daily_data = messages_per_second × avg_message_size × 86400
total_disk = daily_data × retention_days × replication_factor × 1.2

Example: 1000 msg/s × 1KB × 86400 × 7 days × 3 RF × 1.2 = ~2.2 TB

NETWORK:
inbound = messages_per_second × avg_message_size × replication_factor
outbound = messages_per_second × avg_message_size × consumer_count

PARTITIONS:
partitions = max(
    target_throughput ÷ 10MB_per_partition,
    max_consumer_instances
)

CONSUMER INSTANCES:
max_useful_instances = partition_count
(More instances than partitions = some idle)

Series Index

Part	Title	Key Topics
0	Introduction & Series Overview	Kafka evolution, use cases
1	Core Concepts & Architecture	Topics, partitions, brokers
2	Storage Engine Deep Dive	Log segments, indexes
3	Replication & Fault Tolerance	ISR, leaders, followers
4	Cluster Coordination	KRaft vs ZooKeeper
5	Producer Internals	Batching, compression
6	Delivery Guarantees	acks, idempotence
7	Advanced Producer Patterns	Partitioners, transactions
8	Consumer Internals	Poll loop, fetch
9	Consumer Groups	Rebalancing, assignment
10	Offset Management	Commit strategies
11	Exactly-Once Semantics	EOS, transactions
12	Schema Registry	Avro, compatibility
13	Security	Authentication, ACLs
14	Monitoring & Operations	Metrics, alerting
15	Streams Fundamentals	DSL, topologies
16	State Stores	RocksDB, queries
17	Windowing & Joins	Time semantics
18	Event-Driven Patterns	Sourcing, CQRS, Saga
19	Error Handling	Retry, DLQ
20	Testing	Unit, integration
21	Cheatsheet	This page!

Previous: Part 20: Testing Kafka Applications

Series Overview: Kafka Compendium Series

Thank you for following the Kafka Compendium series! If you found this helpful, consider sharing it with your team. For questions or feedback, reach out via the comments.