Devops
Cheatsheet & Decision Guide
At a Glance
This is your quick reference for everything covered in the Kafka Compendium series. Bookmark this page for:
- Configuration cheatsheets
- CLI command reference
- Decision trees for common choices
- Troubleshooting guides
- Spring Kafka quick reference
- Production checklist
Producer Configuration Cheatsheet
Essential Producer Settings
PROPERTIES(28 lines)CodeLoading syntax highlighter...
Producer Profiles
┌──────────────────────────────────────────────────────────────────────────┐ │ PRODUCER CONFIGURATION PROFILES │ ├─────────────────────┬─────────────────────┬──────────────────────────────┤ │ Setting │ High Throughput │ High Reliability │ ├─────────────────────┼─────────────────────┼──────────────────────────────┤ │ acks │ 1 │ all │ │ batch.size │ 65536 (64KB) │ 16384 (16KB) │ │ linger.ms │ 20-100 │ 0-5 │ │ compression.type │ lz4 or zstd │ lz4 │ │ buffer.memory │ 67108864 (64MB) │ 33554432 (32MB) │ │ enable.idempotence │ false (optional) │ true │ │ retries │ 3 │ MAX_INT │ └─────────────────────┴─────────────────────┴──────────────────────────────┘
Consumer Configuration Cheatsheet
Essential Consumer Settings
PROPERTIES(33 lines)CodeLoading syntax highlighter...
Consumer Timeout Decision
┌──────────────────────────────────────────────────────────────────────────┐ │ CONSUMER TIMEOUT GUIDE │ └──────────────────────────────────────────────────────────────────────────┘ session.timeout.ms = 45000 (45s) ├── Heartbeat-based detection ├── Consumer marked dead if no heartbeat └── Triggers rebalance heartbeat.interval.ms = 3000 (3s) ├── Should be 1/3 of session.timeout └── Background thread, not affected by processing max.poll.interval.ms = 300000 (5min) ├── Processing-based detection ├── Consumer marked dead if poll() not called └── Set based on max processing time RULE: max.poll.interval.ms > max_processing_time_per_batch session.timeout.ms ≈ 3 × heartbeat.interval.ms
Kafka Streams Configuration Cheatsheet
PROPERTIES(29 lines)CodeLoading syntax highlighter...
CLI Commands Reference
Topic Management
BASH(26 lines)CodeLoading syntax highlighter...
Consumer Groups
BASH(32 lines)CodeLoading syntax highlighter...
Console Producer/Consumer
BASH(17 lines)CodeLoading syntax highlighter...
ACLs
BASH(22 lines)CodeLoading syntax highlighter...
Decision Trees
When to Use Kafka vs Alternatives
┌──────────────────────────────────────────────────────────────────────────┐ │ MESSAGING SYSTEM DECISION │ └──────────────────────────────────────────────────────────────────────────┘ Start: What's your primary use case? ├── Point-to-point messaging with routing? │ └── Consider: RabbitMQ, ActiveMQ │ ├── Real-time event streaming with replay? │ └── Consider: KAFKA ✓ │ ├── Pub/sub with message filtering? │ ├── Need replay/persistence? → KAFKA ✓ │ └── Fire-and-forget ok? → Redis Pub/Sub, SNS │ ├── Stream processing with state? │ └── Consider: Kafka Streams, Flink, Spark │ └── Simple task queue? └── Consider: Redis, SQS, Celery
Partition Count Decision
┌──────────────────────────────────────────────────────────────────────────┐ │ HOW MANY PARTITIONS? │ └──────────────────────────────────────────────────────────────────────────┘ Step 1: Determine parallelism requirement max_consumers = max(current_consumers, expected_consumers) Step 2: Calculate throughput requirement partitions_for_throughput = target_throughput / throughput_per_partition (Rule of thumb: 10 MB/s per partition) Step 3: Consider ordering requirements If strict ordering needed: partitions = 1 (per ordering key) If ordering by key: partitions can be higher Step 4: Final calculation partitions = max(max_consumers, partitions_for_throughput) Step 5: Add growth buffer final_partitions = partitions × 1.5 CONSTRAINTS: • Max ~4000 partitions per broker • More partitions = longer recovery time • Can only increase, not decrease • Start conservative, increase later
Delivery Guarantee Decision
┌──────────────────────────────────────────────────────────────────────────┐ │ DELIVERY GUARANTEE CHOICE │ └──────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────┐ │ Can you lose │ │ messages? │ └──────────┬──────────┘ │ ┌─────────────────┴─────────────────┐ │ YES │ NO ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ AT-MOST-ONCE │ │ Can consumer │ │ │ │ handle │ │ acks=0 │ │ duplicates? │ │ Fast, may lose │ └────────┬────────┘ └─────────────────┘ │ ┌─────────────┴─────────────┐ │ YES │ NO ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ AT-LEAST-ONCE │ │ EXACTLY-ONCE │ │ │ │ │ │ acks=all │ │ Transactions + │ │ retries=MAX │ │ Idempotent │ │ May duplicate │ │ consumer │ └─────────────────┘ └─────────────────┘
KStream vs KTable Decision
┌──────────────────────────────────────────────────────────────────────────┐ │ KSTREAM VS KTABLE │ └──────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────┐ │ Do you need every │ │ record/event? │ └──────────┬──────────┘ │ ┌─────────────────┴─────────────────┐ │ YES │ NO (only latest) ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ KSTREAM │ │ KTABLE │ │ │ │ │ │ • Event log │ │ • Current state │ │ • Transactions │ │ • Reference data│ │ • Actions │ │ • Aggregations │ │ • Notifications │ │ • Lookups │ └─────────────────┘ └─────────────────┘
Spring Kafka Quick Reference
Basic Configuration
YAML(19 lines)CodeLoading syntax highlighter...
Common Annotations
JAVA(34 lines)CodeLoading syntax highlighter...
Sending Messages
JAVA(25 lines)CodeLoading syntax highlighter...
Troubleshooting Guide
Consumer Lag Growing
┌──────────────────────────────────────────────────────────────────────────┐ │ CONSUMER LAG GROWING │ └──────────────────────────────────────────────────────────────────────────┘ 1. Check if consumer is running $ kafka-consumer-groups.sh --describe --group my-group Look for: CONSUMER-ID column 2. Check consumer health $ curl http://app:8080/actuator/health/kafka 3. Check processing rate - Compare records-consumed-rate vs production rate - If consumption < production → need more consumers or faster processing 4. Check for rebalancing - State should be "Stable" - Frequent rebalances → unstable consumers 5. Check downstream dependencies - Database slow? - External API latency? SOLUTIONS: • Add more consumers (up to partition count) • Increase max.poll.records • Optimize processing logic • Scale downstream services • Consider async processing
Consumer Keeps Rebalancing
┌──────────────────────────────────────────────────────────────────────────┐ │ FREQUENT REBALANCES │ └──────────────────────────────────────────────────────────────────────────┘ SYMPTOMS: • Consumer group state flips between "PreparingRebalance" and "Stable" • Processing stops during rebalances • Logs show "Revoking previously assigned partitions" CAUSES & FIXES: 1. Processing too slow (max.poll.interval.ms exceeded) FIX: Increase max.poll.interval.ms OR reduce max.poll.records OR optimize processing 2. Consumer crashes/restarts FIX: Fix application stability Consider static membership (group.instance.id) 3. Session timeout too aggressive FIX: Increase session.timeout.ms Ensure heartbeat.interval.ms = session.timeout.ms / 3 4. JVM GC pauses FIX: Tune GC settings Increase heap Use G1GC 5. Network issues FIX: Check network stability Increase request.timeout.ms
Producer Send Failures
┌──────────────────────────────────────────────────────────────────────────┐ │ PRODUCER SEND FAILURES │ └──────────────────────────────────────────────────────────────────────────┘ TimeoutException: • Check broker connectivity • Increase delivery.timeout.ms • Check broker load RecordTooLargeException: • Message > max.message.bytes (default 1MB) • Increase broker message.max.bytes • OR compress messages • OR reduce message size BufferExhaustedException: • buffer.memory full • Increase buffer.memory • OR slow down production • Check for broker issues blocking sends NotLeaderOrFollowerException: • Partition leader changed • Will auto-retry with enable.idempotence=true • If persistent, check cluster health AuthorizationException: • Check ACLs • Verify credentials
Production Checklist
Before Going Live
┌──────────────────────────────────────────────────────────────────────────┐ │ PRODUCTION READINESS CHECKLIST │ └──────────────────────────────────────────────────────────────────────────┘ INFRASTRUCTURE: □ Minimum 3 brokers for HA □ Topics have replication.factor >= 3 □ min.insync.replicas = 2 □ unclean.leader.election.enable = false □ Separate disks for data and logs □ Monitoring (JMX/Prometheus) enabled PRODUCER: □ acks = all □ enable.idempotence = true □ retries = MAX_INT (or high number) □ delivery.timeout.ms > expected latency □ Error handling for send failures □ Metrics collection CONSUMER: □ enable.auto.commit = false (manual control) □ Appropriate max.poll.interval.ms □ Error handling (retry + DLQ) □ Idempotent processing □ Consumer lag monitoring □ Graceful shutdown handling SECURITY: □ SSL/TLS enabled □ SASL authentication configured □ ACLs for topics and groups □ Encryption at rest (if needed) OPERATIONS: □ Monitoring dashboards □ Alerting on key metrics □ Runbooks for common issues □ Backup/DR strategy □ Capacity planning □ Schema Registry (if using Avro/Protobuf)
Metrics to Monitor
┌──────────────────────────────────────────────────────────────────────────┐ │ KEY METRICS │ ├────────────────────────────────────┬─────────────────────────────────────┤ │ BROKER │ ALERT WHEN │ ├────────────────────────────────────┼─────────────────────────────────────┤ │ OfflinePartitionsCount │ > 0 │ │ UnderReplicatedPartitions │ > 0 for > 5 min │ │ ActiveControllerCount │ != 1 │ │ IsrShrinksPerSec │ Sustained high rate │ │ RequestQueueSize │ Growing │ ├────────────────────────────────────┼─────────────────────────────────────┤ │ PRODUCER │ │ ├────────────────────────────────────┼─────────────────────────────────────┤ │ record-error-rate │ > 0 sustained │ │ record-retry-rate │ High rate │ │ buffer-available-bytes │ Near 0 │ │ request-latency-avg │ Increasing │ ├────────────────────────────────────┼─────────────────────────────────────┤ │ CONSUMER │ │ ├────────────────────────────────────┼─────────────────────────────────────┤ │ records-lag-max │ Growing │ │ records-consumed-rate │ Dropping │ │ rebalance-rate-per-hour │ > 1-2/hour │ │ commit-rate │ Dropping to 0 │ └────────────────────────────────────┴─────────────────────────────────────┘
Quick Formulas
┌──────────────────────────────────────────────────────────────────────────┐ │ CAPACITY PLANNING FORMULAS │ └──────────────────────────────────────────────────────────────────────────┘ DISK SPACE: daily_data = messages_per_second × avg_message_size × 86400 total_disk = daily_data × retention_days × replication_factor × 1.2 Example: 1000 msg/s × 1KB × 86400 × 7 days × 3 RF × 1.2 = ~2.2 TB NETWORK: inbound = messages_per_second × avg_message_size × replication_factor outbound = messages_per_second × avg_message_size × consumer_count PARTITIONS: partitions = max( target_throughput ÷ 10MB_per_partition, max_consumer_instances ) CONSUMER INSTANCES: max_useful_instances = partition_count (More instances than partitions = some idle)
Series Index
| Part | Title | Key Topics |
|---|---|---|
| 0 | Introduction & Series Overview | Kafka evolution, use cases |
| 1 | Core Concepts & Architecture | Topics, partitions, brokers |
| 2 | Storage Engine Deep Dive | Log segments, indexes |
| 3 | Replication & Fault Tolerance | ISR, leaders, followers |
| 4 | Cluster Coordination | KRaft vs ZooKeeper |
| 5 | Producer Internals | Batching, compression |
| 6 | Delivery Guarantees | acks, idempotence |
| 7 | Advanced Producer Patterns | Partitioners, transactions |
| 8 | Consumer Internals | Poll loop, fetch |
| 9 | Consumer Groups | Rebalancing, assignment |
| 10 | Offset Management | Commit strategies |
| 11 | Exactly-Once Semantics | EOS, transactions |
| 12 | Schema Registry | Avro, compatibility |
| 13 | Security | Authentication, ACLs |
| 14 | Monitoring & Operations | Metrics, alerting |
| 15 | Streams Fundamentals | DSL, topologies |
| 16 | State Stores | RocksDB, queries |
| 17 | Windowing & Joins | Time semantics |
| 18 | Event-Driven Patterns | Sourcing, CQRS, Saga |
| 19 | Error Handling | Retry, DLQ |
| 20 | Testing | Unit, integration |
| 21 | Cheatsheet | This page! |
Series Navigation
Previous: Part 20: Testing Kafka Applications
Series Overview: Kafka Compendium Series
Thank you for following the Kafka Compendium series! If you found this helpful, consider sharing it with your team. For questions or feedback, reach out via the comments.