Devops
Delivery Guarantees & Idempotence
At a Glance
| Aspect | Details |
|---|---|
| Topic | acks configuration, retries, idempotent producer, exactly-once |
| Complexity | Intermediate-Advanced |
| Prerequisites | Part 5 (Producer Internals) |
| Time | 90 minutes |
| Spring Kafka | Retry templates, idempotence configuration |
What You'll Learn
After completing this article, you will be able to:
- Explain the trade-offs between acks=0, acks=1, and acks=all
- Configure retries without causing message reordering
- Enable idempotent producer to prevent duplicates during retries
- Understand how
max.in.flight.requests.per.connectionaffects ordering - Implement Spring Kafka producers with exactly-once semantics
Production Story: The Duplicate Payment Processing
The Incident
It was the end of the month, and our payment processing system was under heavy load. We started receiving customer complaints: "I was charged twice for the same order!" Checking the database confirmed it - some payments were indeed processed twice.
The Investigation
JAVA(20 lines)CodeLoading syntax highlighter...
The logs showed what happened:
14:32:45.123 INFO - Payment sent for order ORD-12345 14:32:46.234 INFO - Payment sent for order ORD-12345 14:32:46.235 WARN - Duplicate payment detected in consumer!
The sequence of events:
┌─────────────────────────────────────────────────────────────────────┐ │ THE DUPLICATE SCENARIO │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 14:32:45.100 Producer sends payment for ORD-12345 │ │ │ │ │ ▼ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Producer │ ────────────────► │ Broker │ │ │ │ │ ProduceRequest │ │ │ │ └─────────────┘ └──────┬──────┘ │ │ │ │ │ 14:32:45.200 Broker writes to log, prepares ACK │ │ │ │ │ 14:32:45.300 Network hiccup! ACK lost │ │ │ ▼ │ │ ┌─────────────┐ X ┌─────────────┐ │ │ │ Producer │ ◄───────────────── │ Broker │ │ │ │ (waiting) │ ACK LOST! │ (sent ACK) │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ │ 14:32:55.100 Producer times out (10s), thinks send failed │ │ │ │ │ ▼ │ │ Application catches timeout, retries manually │ │ │ │ │ ▼ │ │ 14:32:55.200 Producer sends SAME payment AGAIN │ │ │ │ │ ▼ │ │ Broker writes second copy - DUPLICATE! │ │ │ │ Result: Two identical payment events in the topic │ │ Consumer processes both = customer charged twice │ │ │ └─────────────────────────────────────────────────────────────────────┘
The Root Cause
Two problems:
- Manual retry without idempotence: Application-level retry after timeout creates duplicates
- No idempotent producer: Kafka doesn't know the second send is a retry
The Fix
JAVA(43 lines)CodeLoading syntax highlighter...
After the fix: Zero duplicate payments.
Mental Model: Acknowledgment Levels
┌─────────────────────────────────────────────────────────────────────────┐ │ ACKNOWLEDGMENT LEVELS (acks) │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ acks=0 (Fire and Forget) │ │ ───────────────────────── │ │ │ │ ┌─────────┐ Message ┌─────────┐ │ │ │Producer │ ─────────────►│ Broker │ Leader │ │ │ │ │(Leader) │ │ │ │ Done! │ │ ??? │ May or may not have received │ │ └─────────┘ └─────────┘ │ │ │ │ • No acknowledgment waited │ │ • Highest throughput, lowest latency │ │ • NO DURABILITY GUARANTEE - data may be lost │ │ • Use case: Metrics, logs where loss is acceptable │ │ │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ acks=1 (Leader Acknowledgment) │ │ ────────────────────────────── │ │ │ │ ┌─────────┐ Message ┌─────────┐ │ │ │Producer │ ─────────────►│ Broker │ Leader │ │ │ │ │(Leader) │ writes to local log │ │ │ │ ACK │ ████ │ │ │ │ Done! │ ◄──────────── │ │ │ │ └─────────┘ └─────────┘ │ │ │ │ │ │ Async replication (may not complete) │ │ ▼ │ │ ┌─────────┐ │ │ │Follower │ May not have the message yet │ │ │ │ │ │ └─────────┘ │ │ │ │ • Leader acknowledges after writing to its log │ │ • Good throughput, reasonable latency │ │ • DATA CAN BE LOST if leader fails before replication │ │ • Use case: Acceptable for most non-critical data │ │ │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ acks=all (Full ISR Acknowledgment) │ │ ────────────────────────────────── │ │ │ │ ┌─────────┐ Message ┌─────────┐ │ │ │Producer │ ─────────────►│ Broker │ Leader │ │ │ │ │(Leader) │ writes to local log │ │ │ │ │ ████ │ │ │ │ │ └────┬────┘ │ │ │ │ │ Replicate to ISR │ │ │ │ ▼ │ │ │ │ ┌─────────┐ ┌─────────┐ │ │ │ │ │Follower1│ │Follower2│ All ISR write │ │ │ │ │ ████ │ │ ████ │ │ │ │ │ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ │ │ └─────┬─────┘ │ │ │ │ ACK │ │ │ │ Done! │ ◄────────────────────────┘ │ │ └─────────┘ │ │ │ │ • All replicas in ISR must acknowledge │ │ • Lowest throughput, highest latency │ │ • STRONGEST DURABILITY - survives leader failure │ │ • Required for: Financial data, critical events │ │ │ │ IMPORTANT: Combine with min.insync.replicas for true durability │ │ acks=all + min.insync.replicas=2 = at least 2 copies before ACK │ │ │ └─────────────────────────────────────────────────────────────────────────┘
Trade-off Summary
┌──────────┬─────────────┬─────────────┬─────────────┬──────────────────┐ │ acks │ Throughput │ Latency │ Durability │ Data Loss? │ ├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤ │ 0 │ Highest │ Lowest │ None │ Yes (any failure)│ ├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤ │ 1 │ High │ Low │ Leader │ Yes (leader fail)│ ├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤ │ all │ Medium │ Higher │ ISR │ No (if ISR >= 2) │ └──────────┴─────────────┴─────────────┴─────────────┴──────────────────┘
Deep Dive
1. The acks Parameter in Detail
JAVA(49 lines)CodeLoading syntax highlighter...
What acks=all Actually Means
Scenario: Topic with replication.factor=3, min.insync.replicas=2 ISR = {Broker1 (Leader), Broker2, Broker3} ┌─────────────────────────────────────────────────────────────────────┐ │ Producer sends with acks=all │ │ │ │ 1. Message arrives at Leader (Broker1) │ │ └── Written to leader's log │ │ │ │ 2. Leader waits for ISR replicas to fetch and write │ │ ├── Broker2 fetches, writes → ACKs to leader │ │ └── Broker3 fetches, writes → ACKs to leader │ │ │ │ 3. Leader has 3 confirmations (self + 2 followers) │ │ └── All ISR have the message │ │ │ │ 4. Leader sends ACK to producer │ │ └── Producer callback invoked with success │ │ │ └─────────────────────────────────────────────────────────────────────┘ What if Broker3 is slow (falls out of ISR)? ISR = {Broker1 (Leader), Broker2} ← Only 2 replicas in sync With min.insync.replicas=2: • acks=all waits for Broker1 + Broker2 (2 replicas) • Message acknowledged when 2 copies exist • Still durable even if one of them fails With min.insync.replicas=1: • acks=all waits for Broker1 only (ISR leader is enough) • DANGEROUS: Only 1 copy, leader failure = data loss
2. Retries and Ordering
JAVA(24 lines)CodeLoading syntax highlighter...
Solutions to Ordering Problem
JAVA(39 lines)CodeLoading syntax highlighter...
3. Idempotent Producer Deep Dive
HOW IDEMPOTENT PRODUCER WORKS: ┌─────────────────────────────────────────────────────────────────────┐ │ Producer Initialization │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ 1. Producer requests PID (Producer ID) from broker │ │ └── Unique ID for this producer instance │ │ │ │ 2. Producer maintains sequence number per partition │ │ └── Starts at 0, increments with each batch │ │ │ │ Producer State: │ │ ┌──────────────────────────────────────────┐ │ │ │ PID: 1234 │ │ │ │ Sequence Numbers: │ │ │ │ orders-0: 42 │ │ │ │ orders-1: 17 │ │ │ │ orders-2: 89 │ │ │ └──────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Sending Messages │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Each batch includes: PID + Sequence Number │ │ │ │ ┌──────────────────────┐ │ │ │ Batch Header │ │ │ │ ├── PID: 1234 │ │ │ │ ├── Epoch: 0 │ │ │ │ └── Seq: 42 │ │ │ │ Messages: [m1, m2] │ │ │ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Broker Processing │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Broker tracks: Last sequence number per PID per partition │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Broker State for orders-0: │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ │ │ PID 1234: last_seq=41 │ │ │ │ │ │ PID 5678: last_seq=99 │ │ │ │ │ └────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ When batch arrives with PID=1234, Seq=42: │ │ ├── Expected seq: 42 (last_seq + 1) │ │ ├── Received seq: 42 │ │ └── ✓ Accept, update last_seq=42 │ │ │ │ When batch arrives with PID=1234, Seq=42 (retry): │ │ ├── Expected seq: 43 (last_seq + 1) │ │ ├── Received seq: 42 (already seen!) │ │ └── ✓ Deduplicate (ACK but don't write again) │ │ │ │ When batch arrives with PID=1234, Seq=44 (gap): │ │ ├── Expected seq: 43 │ │ ├── Received seq: 44 │ │ └── ✗ OutOfOrderSequenceException (sequence gap) │ │ │ └─────────────────────────────────────────────────────────────────────┘
Idempotent Producer Limitations
JAVA(29 lines)CodeLoading syntax highlighter...
4. Configuring Delivery Timeout
JAVA(30 lines)CodeLoading syntax highlighter...
Timeout Visualization
DELIVERY TIMEOUT TIMELINE: delivery.timeout.ms = 30000ms (30s) request.timeout.ms = 10000ms (10s) retry.backoff.ms = 100ms ├── Request 1 ──┤ T=0 │ │ send() ├───────────────┤ │ 10s timeout │ │ │ T=10s │ FAILED! │ │ │ ├── Backoff ────┤ │ 100ms │ T=10.1s │ │ ├── Request 2 ──┤ │ │ │ 10s timeout │ │ │ T=20.1s │ FAILED! │ │ │ ├── Backoff ────┤ T=20.2s │ │ ├── Request 3 ──┤ │ │ T=30s ─────────────►│ DELIVERY │ │ TIMEOUT! │ │ │ TimeoutException thrown (not enough time for request 3 to complete) With successful send: T=0 send() ├── Request 1 ──┤ T=0.1s │ SUCCESS! │ │ │ ▼ Callback invoked with RecordMetadata
5. Spring Kafka Error Handling
JAVA(51 lines)CodeLoading syntax highlighter...
Per-Send Error Handling
JAVA(60 lines)CodeLoading syntax highlighter...
6. Choosing the Right Configuration
JAVA(65 lines)CodeLoading syntax highlighter...
Common Mistakes
Mistake 1: Manual Retries with Idempotent Producer
JAVA(23 lines)CodeLoading syntax highlighter...
Mistake 2: acks=all Without min.insync.replicas
JAVA(22 lines)CodeLoading syntax highlighter...
Mistake 3: Ignoring Idempotent Producer Requirements
JAVA(10 lines)CodeLoading syntax highlighter...
Mistake 4: Short Delivery Timeout with High Retries
JAVA(11 lines)CodeLoading syntax highlighter...
Mistake 5: Not Understanding "At Least Once"
JAVA(16 lines)CodeLoading syntax highlighter...
Debug This
Scenario: OutOfOrderSequenceException
Symptoms:
- Producer suddenly starts failing with
OutOfOrderSequenceException - Error message: "The broker received an out of order sequence number"
- Happens after network partition resolved
Investigation:
JAVA(12 lines)CodeLoading syntax highlighter...
BASH(5 lines)CodeLoading syntax highlighter...
Resolution:
JAVA(30 lines)CodeLoading syntax highlighter...
Exercises
Exercise 1: Durability Testing
Create a test that:
- Sets up a 3-broker Kafka cluster
- Produces messages with acks=all
- Kills the leader broker mid-send
- Verifies no messages are lost
- Compares behavior with acks=1
Exercise 2: Retry Behavior Analysis
Write a test that:
- Simulates network latency (using Toxiproxy or similar)
- Sends messages with different
delivery.timeout.msvalues - Measures actual retry counts
- Documents the relationship between timeout and retries
Exercise 3: Idempotent Producer Verification
Build a test that:
- Enables idempotent producer
- Uses
MockProduceror network simulation to force retries - Verifies broker receives each message exactly once
- Checks sequence numbers in ProduceRequest
Exercise 4: Error Classification
Create an error handler that:
- Categorizes all Kafka producer exceptions
- Implements appropriate handling for each:
- Retryable vs non-retryable
- Alerting requirements
- Dead letter routing
- Tracks error rates by type
Exercise 5: Durability Configuration Validator
Build a Spring Boot component that:
- On startup, validates producer config against topic config
- Verifies acks=all topics have min.insync.replicas >= 2
- Warns if idempotence is disabled for critical topics
- Fails startup if durability requirements not met
Interview Questions
Q1: Explain the difference between acks=1 and acks=all. When would you use each?
A: The difference is in durability guarantees:
acks=1 (Leader Acknowledgment):
- Producer waits only for leader to write
- If leader fails before replication, message is lost
- Faster than acks=all (one less network hop)
Use acks=1 for:
- Event streams where occasional loss is acceptable
- High-throughput scenarios where latency matters
- Non-critical data (metrics, logs)
acks=all (ISR Acknowledgment):
- Producer waits for all ISR replicas to write
- Message survives leader failure
- Requires
min.insync.replicas>= 2 for true durability
Use acks=all for:
- Financial transactions
- Order processing
- Any data where loss is unacceptable
- Audit logs, compliance data
Key insight: acks=all alone isn't enough. With
min.insync.replicas=1 (default), if only the leader is in ISR, acks=all behaves like acks=1.Q2: What is an idempotent producer and what problems does it solve?
A: An idempotent producer guarantees that retries don't create duplicate messages in the broker:
How it works:
- Producer gets unique PID (Producer ID) from broker
- Each batch includes PID + sequence number
- Broker tracks last sequence per PID per partition
- If duplicate batch arrives (same PID + seq), broker acknowledges but doesn't write
Problems it solves:
Duplicate messages from retries:
- Producer sends, network times out
- Broker actually received and wrote message
- Producer retries → would create duplicate without idempotence
Out-of-order messages:
- With max.in.flight > 1, batches can arrive out of order
- Idempotent producer allows broker to reorder
Requirements:
- acks=all (automatic)
- max.in.flight.requests.per.connection <= 5
- Retries = MAX_VALUE (automatic)
Limitations:
- Only within single producer session (new PID on restart)
- Per-partition guarantee (not cross-partition)
- For full exactly-once, need transactional producer
Q3: How does max.in.flight.requests.per.connection affect ordering, and why is 5 safe with idempotence?
A: This setting controls how many unacknowledged requests can be in flight:
Without idempotence:
- max.in.flight > 1 can cause reordering
- Batch A fails, Batch B succeeds, Batch A retried
- Final order: B, A (wrong!)
With idempotence (max.in.flight <= 5):
- Broker tracks sequence numbers
- Can buffer and reorder up to 5 out-of-order batches
- If Batch B arrives before Batch A, broker waits for A
- Guarantees proper ordering
Why exactly 5?:
- Trade-off between throughput and memory
- More in-flight = higher throughput (pipelining)
- But requires more broker memory for buffering
- 5 chosen as optimal balance
- Higher values would increase broker memory pressure
Practical implications:
- max.in.flight=1: Strict ordering, lower throughput
- max.in.flight=5 + idempotence: Both ordering and throughput
- max.in.flight>5 + idempotence: Not supported, error
Q4: What happens when delivery.timeout.ms is exceeded?
A: When delivery timeout expires, the send permanently fails:
Timeline:
- Producer sends message, starts delivery timeout timer
- Request fails (network, broker error, etc.)
- Producer retries automatically (with idempotence)
- Retries continue with backoff
- If timeout expires before successful ACK:
- TimeoutException thrown
- Callback invoked with exception
- Message NOT in topic (or might be - unknown state)
The "unknown state" problem:
- Timeout doesn't mean message wasn't written
- Broker might have written but ACK was lost
- Producer can't know for sure
- This is why idempotence matters - retry is safe
Best practices:
- Set delivery timeout based on acceptable latency
- Handle TimeoutException explicitly
- For critical data, save to dead letter queue
- Consider: is it better to retry or alert?
Configuration relationship:
delivery.timeout.ms >= linger.ms + request.timeout.ms Effective retry time = delivery.timeout.ms - linger.ms Number of retries ≈ (effective retry time) / (request.timeout + backoff)
Q5: In what scenarios would you NOT use acks=all?
A: Despite acks=all being the safest, there are valid reasons to use other settings:
Use acks=0 when:
- Data loss is completely acceptable
- Maximum throughput/minimum latency required
- Examples: Real-time metrics, clickstream data, non-critical logs
- Trade-off: ~10x throughput improvement over acks=all
Use acks=1 when:
- Occasional loss acceptable but not common case
- Need balance between durability and throughput
- Single-datacenter deployment (leader failure rare)
- Examples: Event streams, user activity tracking
- Trade-off: ~2-3x throughput improvement over acks=all
Specific scenarios favoring lower acks:
- High-frequency metrics: Losing 0.01% of metrics acceptable
- Real-time analytics: Old data becomes stale quickly anyway
- CDC replication: Source database has the source of truth
- Logging: Better to lose some logs than slow down application
When to always use acks=all:
- Financial transactions
- Order/inventory systems
- Compliance and audit data
- Any data that triggers downstream actions
- When replay/recovery would be costly
Summary
Key Takeaways
-
acks=0, 1, all provide different trade-offs between throughput and durability
-
acks=all requires min.insync.replicas >= 2 for true durability
-
Idempotent producer prevents duplicates during retries by tracking sequence numbers
-
max.in.flight.requests.per.connection <= 5 is safe with idempotence for ordering
-
delivery.timeout.ms is the total time allowed for a send including all retries
-
Retries are automatic with idempotent producer - don't implement manual retry
-
TimeoutException doesn't mean message wasn't written - state is unknown
-
Enable idempotence by default for most production use cases
Quick Reference
Delivery Guarantee Configurations
PROPERTIES(17 lines)CodeLoading syntax highlighter...
Error Types and Handling
| Error | Retryable | Action |
|---|---|---|
| TimeoutException | N/A | Dead letter / Alert |
| NotEnoughReplicasException | Yes | Wait for cluster recovery |
| RecordTooLargeException | No | Fix message size |
| SerializationException | No | Fix serialization |
| OutOfOrderSequenceException | No | Recreate producer |
| InvalidProducerEpochException | No | Recreate producer |
Configuration Constraints
With enable.idempotence=true: ├── acks MUST be "all" ├── retries MUST be > 0 (auto: MAX_VALUE) └── max.in.flight MUST be <= 5 With transactional.id set: ├── enable.idempotence MUST be true └── All idempotence constraints apply
Series Navigation
| Previous | Current | Next |
|---|---|---|
| Part 5: Producer Internals | Part 6: Delivery Guarantees | Part 7: Advanced Producer Patterns |
Series Overview
- Part 0: How to Use This Series
- Parts 1-4: Fundamentals
- Parts 5-7: Producers (Internals, Delivery Guarantees, Advanced Patterns)
- Parts 8-11: Consumers
- Parts 12-14: Operations
- Parts 15-17: Kafka Streams
- Parts 18-20: Patterns & Practices
- Part 21: Cheatsheet & Decision Guide