Delivery Guarantees & Idempotence

At a Glance

Aspect	Details
Topic	acks configuration, retries, idempotent producer, exactly-once
Complexity	Intermediate-Advanced
Prerequisites	Part 5 (Producer Internals)
Time	90 minutes
Spring Kafka	Retry templates, idempotence configuration

What You'll Learn

After completing this article, you will be able to:

Explain the trade-offs between acks=0, acks=1, and acks=all
Configure retries without causing message reordering
Enable idempotent producer to prevent duplicates during retries
Understand how max.in.flight.requests.per.connection affects ordering
Implement Spring Kafka producers with exactly-once semantics

Production Story: The Duplicate Payment Processing

The Incident

It was the end of the month, and our payment processing system was under heavy load. We started receiving customer complaints: "I was charged twice for the same order!" Checking the database confirmed it - some payments were indeed processed twice.

The Investigation

JAVA(20 lines)
Code
Loading syntax highlighter...

The logs showed what happened:

14:32:45.123 INFO  - Payment sent for order ORD-12345
14:32:46.234 INFO  - Payment sent for order ORD-12345
14:32:46.235 WARN  - Duplicate payment detected in consumer!

The sequence of events:

┌─────────────────────────────────────────────────────────────────────┐
│                    THE DUPLICATE SCENARIO                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  14:32:45.100  Producer sends payment for ORD-12345                 │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────┐                    ┌─────────────┐                 │
│  │  Producer   │ ────────────────►  │   Broker    │                 │
│  │             │    ProduceRequest  │             │                 │
│  └─────────────┘                    └──────┬──────┘                 │
│                                            │                        │
│  14:32:45.200  Broker writes to log, prepares ACK                   │
│                                            │                        │
│  14:32:45.300  Network hiccup! ACK lost   │                         │
│                                            ▼                        │
│  ┌─────────────┐         X          ┌─────────────┐                 │
│  │  Producer   │ ◄───────────────── │   Broker    │                 │
│  │  (waiting)  │    ACK LOST!       │  (sent ACK) │                 │
│  └─────────────┘                    └─────────────┘                 │
│       │                                                             │
│  14:32:55.100  Producer times out (10s), thinks send failed         │
│       │                                                             │
│       ▼                                                             │
│  Application catches timeout, retries manually                      │
│       │                                                             │
│       ▼                                                             │
│  14:32:55.200  Producer sends SAME payment AGAIN                    │
│       │                                                             │
│       ▼                                                             │
│  Broker writes second copy - DUPLICATE!                             │
│                                                                     │
│  Result: Two identical payment events in the topic                  │
│          Consumer processes both = customer charged twice           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The Root Cause

Two problems:

Manual retry without idempotence: Application-level retry after timeout creates duplicates
No idempotent producer: Kafka doesn't know the second send is a retry

The Fix

JAVA(43 lines)
Code
Loading syntax highlighter...

After the fix: Zero duplicate payments.

Mental Model: Acknowledgment Levels

┌─────────────────────────────────────────────────────────────────────────┐
│                    ACKNOWLEDGMENT LEVELS (acks)                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  acks=0 (Fire and Forget)                                               │
│  ─────────────────────────                                              │
│                                                                         │
│  ┌─────────┐    Message    ┌─────────┐                                  │
│  │Producer │ ─────────────►│ Broker  │  Leader                          │
│  │         │               │(Leader) │                                  │
│  │  Done!  │               │   ???   │  May or may not have received    │
│  └─────────┘               └─────────┘                                  │
│                                                                         │
│  • No acknowledgment waited                                             │
│  • Highest throughput, lowest latency                                   │
│  • NO DURABILITY GUARANTEE - data may be lost                           │
│  • Use case: Metrics, logs where loss is acceptable                     │
│                                                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  acks=1 (Leader Acknowledgment)                                         │
│  ──────────────────────────────                                         │
│                                                                         │
│  ┌─────────┐    Message    ┌─────────┐                                  │
│  │Producer │ ─────────────►│ Broker  │  Leader                          │
│  │         │               │(Leader) │  writes to local log             │
│  │         │    ACK        │  ████   │                                  │
│  │  Done!  │ ◄──────────── │         │                                  │
│  └─────────┘               └─────────┘                                  │
│                                 │                                       │
│                                 │ Async replication (may not complete)  │
│                                 ▼                                       │
│                            ┌─────────┐                                  │
│                            │Follower │  May not have the message yet    │
│                            │         │                                  │
│                            └─────────┘                                  │
│                                                                         │
│  • Leader acknowledges after writing to its log                         │
│  • Good throughput, reasonable latency                                  │
│  • DATA CAN BE LOST if leader fails before replication                  │
│  • Use case: Acceptable for most non-critical data                      │
│                                                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  acks=all (Full ISR Acknowledgment)                                     │
│  ──────────────────────────────────                                     │
│                                                                         │
│  ┌─────────┐    Message    ┌─────────┐                                  │
│  │Producer │ ─────────────►│ Broker  │  Leader                          │
│  │         │               │(Leader) │  writes to local log             │
│  │         │               │  ████   │                                  │
│  │         │               └────┬────┘                                  │
│  │         │                    │ Replicate to ISR                      │
│  │         │                    ▼                                       │
│  │         │               ┌─────────┐ ┌─────────┐                      │
│  │         │               │Follower1│ │Follower2│  All ISR write       │
│  │         │               │  ████   │ │  ████   │                      │
│  │         │               └────┬────┘ └────┬────┘                      │
│  │         │                    │           │                           │
│  │         │                    └─────┬─────┘                           │
│  │         │    ACK                   │                                 │
│  │  Done!  │ ◄────────────────────────┘                                 │
│  └─────────┘                                                            │
│                                                                         │
│  • All replicas in ISR must acknowledge                                 │
│  • Lowest throughput, highest latency                                   │
│  • STRONGEST DURABILITY - survives leader failure                       │
│  • Required for: Financial data, critical events                        │
│                                                                         │
│  IMPORTANT: Combine with min.insync.replicas for true durability        │
│  acks=all + min.insync.replicas=2 = at least 2 copies before ACK        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Trade-off Summary

┌──────────┬─────────────┬─────────────┬─────────────┬──────────────────┐
│   acks   │ Throughput  │   Latency   │  Durability │   Data Loss?     │
├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤
│    0     │   Highest   │   Lowest    │    None     │ Yes (any failure)│
├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤
│    1     │    High     │    Low      │   Leader    │ Yes (leader fail)│
├──────────┼─────────────┼─────────────┼─────────────┼──────────────────┤
│   all    │   Medium    │   Higher    │    ISR      │ No (if ISR >= 2) │
└──────────┴─────────────┴─────────────┴─────────────┴──────────────────┘

Deep Dive

1. The acks Parameter in Detail

JAVA(49 lines)
Code
Loading syntax highlighter...

What acks=all Actually Means

Scenario: Topic with replication.factor=3, min.insync.replicas=2

ISR = {Broker1 (Leader), Broker2, Broker3}

┌─────────────────────────────────────────────────────────────────────┐
│  Producer sends with acks=all                                       │
│                                                                     │
│  1. Message arrives at Leader (Broker1)                             │
│     └── Written to leader's log                                     │
│                                                                     │
│  2. Leader waits for ISR replicas to fetch and write                │
│     ├── Broker2 fetches, writes → ACKs to leader                    │
│     └── Broker3 fetches, writes → ACKs to leader                    │
│                                                                     │
│  3. Leader has 3 confirmations (self + 2 followers)                 │
│     └── All ISR have the message                                    │
│                                                                     │
│  4. Leader sends ACK to producer                                    │
│     └── Producer callback invoked with success                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

What if Broker3 is slow (falls out of ISR)?

ISR = {Broker1 (Leader), Broker2}  ← Only 2 replicas in sync

With min.insync.replicas=2:
  • acks=all waits for Broker1 + Broker2 (2 replicas)
  • Message acknowledged when 2 copies exist
  • Still durable even if one of them fails

With min.insync.replicas=1:
  • acks=all waits for Broker1 only (ISR leader is enough)
  • DANGEROUS: Only 1 copy, leader failure = data loss

2. Retries and Ordering

JAVA(24 lines)
Code
Loading syntax highlighter...

Solutions to Ordering Problem

JAVA(39 lines)
Code
Loading syntax highlighter...

3. Idempotent Producer Deep Dive

HOW IDEMPOTENT PRODUCER WORKS:

┌─────────────────────────────────────────────────────────────────────┐
│  Producer Initialization                                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Producer requests PID (Producer ID) from broker                 │
│     └── Unique ID for this producer instance                        │
│                                                                     │
│  2. Producer maintains sequence number per partition                │
│     └── Starts at 0, increments with each batch                     │
│                                                                     │
│  Producer State:                                                    │
│  ┌──────────────────────────────────────────┐                       │
│  │  PID: 1234                               │                       │
│  │  Sequence Numbers:                       │                       │
│  │    orders-0: 42                          │                       │
│  │    orders-1: 17                          │                       │
│  │    orders-2: 89                          │                       │
│  └──────────────────────────────────────────┘                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│  Sending Messages                                                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Each batch includes: PID + Sequence Number                         │
│                                                                     │
│  ┌──────────────────────┐                                           │
│  │  Batch Header        │                                           │
│  │  ├── PID: 1234       │                                           │
│  │  ├── Epoch: 0        │                                           │
│  │  └── Seq: 42         │                                           │
│  │  Messages: [m1, m2]  │                                           │
│  └──────────────────────┘                                           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│  Broker Processing                                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Broker tracks: Last sequence number per PID per partition          │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────┐       │
│  │  Broker State for orders-0:                              │       │
│  │  ┌────────────────────────────────────────────────────┐  │       │
│  │  │  PID 1234: last_seq=41                             │  │       │
│  │  │  PID 5678: last_seq=99                             │  │       │
│  │  └────────────────────────────────────────────────────┘  │       │
│  └──────────────────────────────────────────────────────────┘       │
│                                                                     │
│  When batch arrives with PID=1234, Seq=42:                          │
│  ├── Expected seq: 42 (last_seq + 1)                                │
│  ├── Received seq: 42                                               │
│  └── ✓ Accept, update last_seq=42                                   │
│                                                                     │
│  When batch arrives with PID=1234, Seq=42 (retry):                  │
│  ├── Expected seq: 43 (last_seq + 1)                                │
│  ├── Received seq: 42 (already seen!)                               │
│  └── ✓ Deduplicate (ACK but don't write again)                      │
│                                                                     │
│  When batch arrives with PID=1234, Seq=44 (gap):                    │
│  ├── Expected seq: 43                                               │
│  ├── Received seq: 44                                               │
│  └── ✗ OutOfOrderSequenceException (sequence gap)                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Idempotent Producer Limitations

JAVA(29 lines)
Code
Loading syntax highlighter...

4. Configuring Delivery Timeout

JAVA(30 lines)
Code
Loading syntax highlighter...

Timeout Visualization

DELIVERY TIMEOUT TIMELINE:

delivery.timeout.ms = 30000ms (30s)
request.timeout.ms = 10000ms (10s)
retry.backoff.ms = 100ms

                    ├── Request 1 ──┤
T=0                 │               │
send()              ├───────────────┤
                    │  10s timeout  │
                    │               │
T=10s               │   FAILED!     │
                    │               │
                    ├── Backoff ────┤
                    │    100ms      │
T=10.1s             │               │
                    ├── Request 2 ──┤
                    │               │
                    │  10s timeout  │
                    │               │
T=20.1s             │   FAILED!     │
                    │               │
                    ├── Backoff ────┤
T=20.2s             │               │
                    ├── Request 3 ──┤
                    │               │
T=30s ─────────────►│ DELIVERY      │
                    │ TIMEOUT!      │
                    │               │
                    TimeoutException thrown
                    (not enough time for request 3 to complete)


With successful send:

T=0      send()
         ├── Request 1 ──┤
T=0.1s   │   SUCCESS!    │
         │               │
         ▼
         Callback invoked with RecordMetadata

5. Spring Kafka Error Handling

JAVA(51 lines)
Code
Loading syntax highlighter...

Per-Send Error Handling

JAVA(60 lines)
Code
Loading syntax highlighter...

6. Choosing the Right Configuration

JAVA(65 lines)
Code
Loading syntax highlighter...

Common Mistakes

Mistake 1: Manual Retries with Idempotent Producer

JAVA(23 lines)
Code
Loading syntax highlighter...

Mistake 2: acks=all Without min.insync.replicas

JAVA(22 lines)
Code
Loading syntax highlighter...

Mistake 3: Ignoring Idempotent Producer Requirements

JAVA(10 lines)
Code
Loading syntax highlighter...

Mistake 4: Short Delivery Timeout with High Retries

JAVA(11 lines)
Code
Loading syntax highlighter...

Mistake 5: Not Understanding "At Least Once"

JAVA(16 lines)
Code
Loading syntax highlighter...

Debug This

Scenario: OutOfOrderSequenceException

Symptoms:

Producer suddenly starts failing with OutOfOrderSequenceException
Error message: "The broker received an out of order sequence number"
Happens after network partition resolved

Investigation:

JAVA(12 lines)
Code
Loading syntax highlighter...

BASH(5 lines)
Code
Loading syntax highlighter...

Resolution:

JAVA(30 lines)
Code
Loading syntax highlighter...

Exercises

Exercise 1: Durability Testing

Create a test that:

Sets up a 3-broker Kafka cluster
Produces messages with acks=all
Kills the leader broker mid-send
Verifies no messages are lost
Compares behavior with acks=1

Exercise 2: Retry Behavior Analysis

Write a test that:

Simulates network latency (using Toxiproxy or similar)
Sends messages with different delivery.timeout.ms values
Measures actual retry counts
Documents the relationship between timeout and retries

Exercise 3: Idempotent Producer Verification

Build a test that:

Enables idempotent producer
Uses MockProducer or network simulation to force retries
Verifies broker receives each message exactly once
Checks sequence numbers in ProduceRequest

Exercise 4: Error Classification

Create an error handler that:

Categorizes all Kafka producer exceptions
Implements appropriate handling for each:
- Retryable vs non-retryable
- Alerting requirements
- Dead letter routing
Tracks error rates by type

Exercise 5: Durability Configuration Validator

Build a Spring Boot component that:

On startup, validates producer config against topic config
Verifies acks=all topics have min.insync.replicas >= 2
Warns if idempotence is disabled for critical topics
Fails startup if durability requirements not met

Interview Questions

Q1: Explain the difference between acks=1 and acks=all. When would you use each?

A: The difference is in durability guarantees:

acks=1 (Leader Acknowledgment):

Producer waits only for leader to write
If leader fails before replication, message is lost
Faster than acks=all (one less network hop)

Use acks=1 for:

Event streams where occasional loss is acceptable
High-throughput scenarios where latency matters
Non-critical data (metrics, logs)

acks=all (ISR Acknowledgment):

Producer waits for all ISR replicas to write
Message survives leader failure
Requires min.insync.replicas >= 2 for true durability

Use acks=all for:

Financial transactions
Order processing
Any data where loss is unacceptable
Audit logs, compliance data

Key insight: acks=all alone isn't enough. With min.insync.replicas=1 (default), if only the leader is in ISR, acks=all behaves like acks=1.

Q2: What is an idempotent producer and what problems does it solve?

A: An idempotent producer guarantees that retries don't create duplicate messages in the broker:

How it works:

Producer gets unique PID (Producer ID) from broker
Each batch includes PID + sequence number
Broker tracks last sequence per PID per partition
If duplicate batch arrives (same PID + seq), broker acknowledges but doesn't write

Problems it solves:

Duplicate messages from retries:

Producer sends, network times out
Broker actually received and wrote message
Producer retries → would create duplicate without idempotence

Out-of-order messages:

With max.in.flight > 1, batches can arrive out of order
Idempotent producer allows broker to reorder

Requirements:

acks=all (automatic)
max.in.flight.requests.per.connection <= 5
Retries = MAX_VALUE (automatic)

Limitations:

Only within single producer session (new PID on restart)
Per-partition guarantee (not cross-partition)
For full exactly-once, need transactional producer

Q3: How does max.in.flight.requests.per.connection affect ordering, and why is 5 safe with idempotence?

A: This setting controls how many unacknowledged requests can be in flight:

Without idempotence:

max.in.flight > 1 can cause reordering
Batch A fails, Batch B succeeds, Batch A retried
Final order: B, A (wrong!)

With idempotence (max.in.flight <= 5):

Broker tracks sequence numbers
Can buffer and reorder up to 5 out-of-order batches
If Batch B arrives before Batch A, broker waits for A
Guarantees proper ordering

Why exactly 5?:

Trade-off between throughput and memory
More in-flight = higher throughput (pipelining)
But requires more broker memory for buffering
5 chosen as optimal balance
Higher values would increase broker memory pressure

Practical implications:

max.in.flight=1: Strict ordering, lower throughput
max.in.flight=5 + idempotence: Both ordering and throughput
max.in.flight>5 + idempotence: Not supported, error

Q4: What happens when delivery.timeout.ms is exceeded?

A: When delivery timeout expires, the send permanently fails:

Timeline:

Producer sends message, starts delivery timeout timer
Request fails (network, broker error, etc.)
Producer retries automatically (with idempotence)
Retries continue with backoff
If timeout expires before successful ACK:
- TimeoutException thrown
- Callback invoked with exception
- Message NOT in topic (or might be - unknown state)

The "unknown state" problem:

Timeout doesn't mean message wasn't written
Broker might have written but ACK was lost
Producer can't know for sure
This is why idempotence matters - retry is safe

Best practices:

Set delivery timeout based on acceptable latency
Handle TimeoutException explicitly
For critical data, save to dead letter queue
Consider: is it better to retry or alert?

Configuration relationship:

delivery.timeout.ms >= linger.ms + request.timeout.ms

Effective retry time = delivery.timeout.ms - linger.ms
Number of retries ≈ (effective retry time) / (request.timeout + backoff)

Q5: In what scenarios would you NOT use acks=all?

A: Despite acks=all being the safest, there are valid reasons to use other settings:

Use acks=0 when:

Data loss is completely acceptable
Maximum throughput/minimum latency required
Examples: Real-time metrics, clickstream data, non-critical logs
Trade-off: ~10x throughput improvement over acks=all

Use acks=1 when:

Occasional loss acceptable but not common case
Need balance between durability and throughput
Single-datacenter deployment (leader failure rare)
Examples: Event streams, user activity tracking
Trade-off: ~2-3x throughput improvement over acks=all

Specific scenarios favoring lower acks:

High-frequency metrics: Losing 0.01% of metrics acceptable
Real-time analytics: Old data becomes stale quickly anyway
CDC replication: Source database has the source of truth
Logging: Better to lose some logs than slow down application

When to always use acks=all:

Financial transactions
Order/inventory systems
Compliance and audit data
Any data that triggers downstream actions
When replay/recovery would be costly

Summary

Key Takeaways

acks=0, 1, all provide different trade-offs between throughput and durability
acks=all requires min.insync.replicas >= 2 for true durability
Idempotent producer prevents duplicates during retries by tracking sequence numbers
max.in.flight.requests.per.connection <= 5 is safe with idempotence for ordering
delivery.timeout.ms is the total time allowed for a send including all retries
Retries are automatic with idempotent producer - don't implement manual retry
TimeoutException doesn't mean message wasn't written - state is unknown
Enable idempotence by default for most production use cases

Quick Reference

Delivery Guarantee Configurations

PROPERTIES(17 lines)
Code
Loading syntax highlighter...

Error Types and Handling

Error	Retryable	Action
TimeoutException	N/A	Dead letter / Alert
NotEnoughReplicasException	Yes	Wait for cluster recovery
RecordTooLargeException	No	Fix message size
SerializationException	No	Fix serialization
OutOfOrderSequenceException	No	Recreate producer
InvalidProducerEpochException	No	Recreate producer

Configuration Constraints

With enable.idempotence=true:
  ├── acks MUST be "all"
  ├── retries MUST be > 0 (auto: MAX_VALUE)
  └── max.in.flight MUST be <= 5

With transactional.id set:
  ├── enable.idempotence MUST be true
  └── All idempotence constraints apply

Previous	Current	Next
Part 5: Producer Internals	Part 6: Delivery Guarantees	Part 7: Advanced Producer Patterns

Series Overview

Part 0: How to Use This Series
Parts 1-4: Fundamentals
Parts 5-7: Producers (Internals, Delivery Guarantees, Advanced Patterns)
Parts 8-11: Consumers
Parts 12-14: Operations
Parts 15-17: Kafka Streams
Parts 18-20: Patterns & Practices
Part 21: Cheatsheet & Decision Guide