Devops

Offset Management


At a Glance

AspectDetails
TopicAuto vs manual commit, commit strategies, seeking, consumer lag
ComplexityIntermediate
PrerequisitesParts 8-9 (Consumer Internals, Consumer Groups)
Time90 minutes
Spring KafkaAckMode options, manual acknowledgment

What You'll Learn

After completing this article, you will be able to:

  1. Choose between auto and manual offset commit strategies
  2. Implement different commit patterns (per-message, per-batch, async)
  3. Use seeking to replay or skip messages
  4. Monitor consumer lag and understand its implications
  5. Configure Spring Kafka AckMode for your use case

Production Story: The Silent Data Loss

The Incident

Our analytics pipeline was processing clickstream data. Everything looked fine in monitoring - no errors, healthy throughput. But one day, a data analyst noticed something alarming: "We're missing 15% of our click events from yesterday. The producer sent 10 million, but we only have 8.5 million in the data warehouse."

The Investigation

JAVA(21 lines)
Code
Loading syntax highlighter...

The timeline revealed the problem:

┌─────────────────────────────────────────────────────────────────────┐
│                    THE AUTO-COMMIT DATA LOSS                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Configuration:                                                     │
│  • enable.auto.commit = true                                        │
│  • auto.commit.interval.ms = 5000 (5 seconds)                       │
│  • max.poll.records = 1000                                          │
│  • Data warehouse insert time = 3-8 seconds                         │
│                                                                     │
│  Timeline:                                                          │
│  ─────────────────────────────────────────────────────────────────  │
│                                                                     │
│  T=0.0s   poll() returns 1000 records (offsets 0-999)               │
│  T=0.1s   Start batch insert to data warehouse                      │
│           ...                                                       │
│  T=5.0s   AUTO-COMMIT HAPPENS (offsets committed!)                  │
│           But insert still in progress...                           │
│           ...                                                       │
│  T=7.0s   Data warehouse insert fails (timeout)                     │
│           Exception thrown                                          │
│                                                                     │
│  BUT:                                                               │
│  • Offsets already committed at T=5s                                │
│  • Those 1000 records are "consumed" in Kafka's view                │
│  • They're NOT in the data warehouse                                │
│  • Consumer moves to next batch                                     │
│  • 1000 events LOST!                                                │
│                                                                     │
│  This happened multiple times yesterday:                            │
│  • ~15 failures × 1000 records = 15,000 lost                        │
│  • Out of 10,000,000 total = 0.15% loss rate                        │
│  • But concentrated in specific time windows = 15% loss visible     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The Root Cause

Auto-commit happened BEFORE processing completed:

  1. poll() returns records
  2. Processing starts
  3. Auto-commit timer fires (5s) - commits offset
  4. Processing fails
  5. Records lost - offset already committed

The Fix

JAVA(45 lines)
Code
Loading syntax highlighter...

After the fix: Zero data loss, with at-most small duplicates on failure recovery.


Mental Model: Offset Lifecycle

┌─────────────────────────────────────────────────────────────────────────┐
│                         OFFSET POSITIONS                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Partition: orders-0                                                    │
│                                                                         │
│  Log:                                                                   │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐          │
│  │ 0  │ 1  │ 2  │ 3  │ 4  │ 5  │ 6  │ 7  │ 8  │ 9  │ 10 │ 11 │          │
│  └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘          │
│                              ▲         ▲         ▲              ▲       │
│                              │         │         │              │       │
│              Committed ──────┘         │         │              │       │
│              Offset: 4                 │         │              │       │
│                                        │         │              │       │
│              Current ──────────────────┘         │              │       │
│              Position: 7                         │              │       │
│              (Next record poll returns)          │              │       │
│                                                  │              │       │
│              In Progress ────────────────────────┘              │       │
│              Offset: 9                                          │       │
│              (Being processed)                                  │       │
│                                                                 │       │
│              Log End ───────────────────────────────────────────┘       │
│              Offset: 12                                                 │
│              (Next write position)                                      │
│                                                                         │
│  Consumer Lag = Log End Offset - Committed Offset = 12 - 4 = 8          │
│  (8 messages behind)                                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘


WHAT HAPPENS ON CONSUMER RESTART:

Before Restart:
  Committed: 4, Current: 7, Processing: records 7,8,9

After Restart:
  Consumer starts from Committed Offset + 1 = 5
  Records 5,6,7,8,9 will be reprocessed!
  (Unless they were committed before crash)

┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  This is why "commit AFTER processing" matters!                    │
│                                                                    │
│  Auto-commit: May commit 9 before processing finishes              │
│               → If crash at 8, records 5-8 lost                    │
│                                                                    │
│  Manual commit after batch: Commit 9 after all processed           │
│               → If crash at 8, restart at 5, reprocess 5-9         │
│               → At-least-once delivery (possible duplicates)       │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Commit Timing Impact

COMMIT TIMING SCENARIOS:

1. COMMIT BEFORE PROCESSING (Dangerous - at-most-once)
   ────────────────────────────────────────────────────
   poll() → commit() → process()

   If process() fails: DATA LOST (offset already committed)
   Guarantee: At-most-once (may lose data)


2. COMMIT AFTER PROCESSING (Safe - at-least-once)
   ────────────────────────────────────────────────────
   poll() → process() → commit()

   If process() fails: Reprocess on restart (offset not committed)
   Guarantee: At-least-once (may have duplicates)


3. AUTO-COMMIT (Risky - unpredictable)
   ────────────────────────────────────────────────────
   poll() → [auto-commit may fire anytime] → process()

   Commit timing unpredictable relative to processing
   Guarantee: Neither (random data loss or duplicates)


4. TRANSACTIONAL COMMIT (Complex - exactly-once)
   ────────────────────────────────────────────────────
   poll() → process() → produce() → commit-with-transaction()

   Consumer offset committed atomically with produced messages
   Guarantee: Exactly-once (with transactional producer)

Deep Dive

1. Auto vs Manual Commit

JAVA(34 lines)
Code
Loading syntax highlighter...

When to Use Each

AUTO-COMMIT:
  ✓ Simple, no code changes needed
  ✓ Good for non-critical data (metrics, logs)
  ✓ When some data loss is acceptable
  ✗ Can lose data on processing failure
  ✗ No control over commit timing

MANUAL COMMIT:
  ✓ Full control over when offsets committed
  ✓ At-least-once delivery guarantee
  ✓ Required for exactly-once processing
  ✗ More complex code
  ✗ Must handle commit failures

RECOMMENDATION:
  Development/Testing: Auto-commit is fine
  Production with data: Always manual commit

2. Manual Commit Strategies

JAVA(80 lines)
Code
Loading syntax highlighter...

Commit Strategy Comparison

┌─────────────────────┬─────────────┬─────────────┬─────────────────────┐
│ Strategy            │ Performance │ Data Safety │ Use Case            │
├─────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ Per-message sync    │ Slowest     │ Safest      │ Financial txns      │
│                     │             │             │ Critical data       │
├─────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ Per-batch sync      │ Medium      │ Safe        │ Most production     │
│                     │             │             │ use cases           │
├─────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ Async commit        │ Fast        │ Less safe   │ High throughput,    │
│                     │             │             │ some loss OK        │
├─────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ Periodic in batch   │ Fast        │ Balanced    │ Large batches,      │
│                     │             │             │ partial progress    │
├─────────────────────┼─────────────┼─────────────┼─────────────────────┤
│ Auto-commit         │ Fastest     │ Risky       │ Non-critical data   │
│                     │             │             │ Metrics, logs       │
└─────────────────────┴─────────────┴─────────────┴─────────────────────┘

3. Spring Kafka AckMode Options

JAVA(17 lines)
Code
Loading syntax highlighter...

AckMode Options

JAVA(46 lines)
Code
Loading syntax highlighter...

4. Seeking to Specific Offsets

JAVA(82 lines)
Code
Loading syntax highlighter...

5. Consumer Lag Monitoring

JAVA(50 lines)
Code
Loading syntax highlighter...

Understanding Consumer Lag

CONSUMER LAG VISUALIZATION:

Partition: orders-0
Time ────────────────────────────────────────────────────────────────►

Producer writes (end offset increases):
End:    100    200    350    500    700    1000    1200
         │      │      │      │      │       │       │
         ▼      ▼      ▼      ▼      ▼       ▼       ▼
Log:   ─────────────────────────────────────────────────────────────►

Consumer commits (committed offset increases):
Committed: 50   100    200    300    500    700     800
           │     │      │      │      │      │       │
           ▼     ▼      ▼      ▼      ▼      ▼       ▼
           ─────────────────────────────────────────────────────────►

Lag = End - Committed:
           50   100    150    200    200    300     400
           │     │      │      │      │      │       │
           │     │      │      │      │      │       └── LAG GROWING!
           │     │      │      │      │      │
           │     │      │      └──────┴──────── LAG STABLE
           │     │      │
           └─────┴──────┴── LAG GROWING INITIALLY

WHAT LAG TELLS YOU:
─────────────────────
• Lag = 0: Consumer caught up, processing in real-time
• Lag stable: Consumer keeping pace with producer
• Lag growing: Consumer falling behind (problem!)
• Lag decreasing: Consumer catching up (recovery)

CAUSES OF GROWING LAG:
──────────────────────
1. Consumer processing too slow
2. Not enough consumer instances
3. Consumer crashing/restarting
4. Producer rate increased
5. Downstream system slow (DB, API)

6. Offset Reset Policy

JAVA(27 lines)
Code
Loading syntax highlighter...

When Offset Reset Triggers

OFFSET RESET SCENARIOS:

1. NEW CONSUMER GROUP
   ─────────────────────
   Consumer group "new-group" subscribes to topic for first time
   No committed offset exists
   → auto.offset.reset applies

2. OFFSET OUT OF RANGE
   ─────────────────────
   Consumer group was offline for days
   Committed offset: 1000
   Current earliest offset: 5000 (old data deleted by retention)
   → Committed offset no longer valid
   → auto.offset.reset applies

3. PARTITION ADDED
   ─────────────────────
   Topic had 3 partitions, now has 6
   Consumer group has no offset for partitions 3,4,5
   → auto.offset.reset applies for new partitions


RESET BEHAVIOR:

┌──────────────────────────────────────────────────────────────────┐
│  auto.offset.reset = earliest                                    │
│                                                                  │
│  Partition:  [msg-0][msg-1][msg-2]...[msg-999][msg-1000]         │
│                ▲                                                 │
│                │                                                 │
│             Start here (beginning)                               │
│                                                                  │
│  Use case: Replay all data, batch processing                     │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  auto.offset.reset = latest                                      │
│                                                                  │
│  Partition:  [msg-0][msg-1][msg-2]...[msg-999][msg-1000]│        │
│                                                         ▲        │
│                                                         │        │
│                                         Start here (end)         │
│                                                                  │
│  Use case: Only new events matter, real-time processing          │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  auto.offset.reset = none                                        │
│                                                                  │
│  Partition:  [msg-0][msg-1][msg-2]...[msg-999][msg-1000]         │
│                                                                  │
│  → NoOffsetForPartitionException thrown                          │
│                                                                  │
│  Use case: Fail-fast, manual offset management                   │
└──────────────────────────────────────────────────────────────────┘

Common Mistakes

Mistake 1: Auto-Commit with Long Processing

JAVA(19 lines)
Code
Loading syntax highlighter...

Mistake 2: Committing Before Processing

JAVA(17 lines)
Code
Loading syntax highlighter...

Mistake 3: Ignoring Commit Failures

JAVA(16 lines)
Code
Loading syntax highlighter...

Mistake 4: Wrong Offset Reset for Use Case

JAVA(17 lines)
Code
Loading syntax highlighter...

Mistake 5: Not Handling Rebalance Commits

JAVA(19 lines)
Code
Loading syntax highlighter...

Debug This

Scenario: Messages Being Skipped

Symptoms:
  • Some messages never processed
  • Consumer lag shows 0 (caught up)
  • No errors in logs
  • Monitoring shows messages produced but not consumed
Investigation:
BASH(11 lines)
Code
Loading syntax highlighter...
JAVA(22 lines)
Code
Loading syntax highlighter...
Common Causes:
  1. Offset reset to latest: New group or out-of-range offset
  2. Seek called incorrectly: Skipping over records
  3. Auto-commit race: Offset committed before processing
  4. Consumer exception suppressed: Not propagating errors
Resolution:
JAVA(27 lines)
Code
Loading syntax highlighter...

Exercises

Exercise 1: Implement Exactly-Once Processing

Create a consumer that:

  1. Reads from input topic
  2. Processes and writes to output topic
  3. Uses transactions for exactly-once semantics
  4. Handles consumer restart gracefully

Exercise 2: Build Offset Recovery System

Implement a system that:

  1. Stores offsets in external database
  2. Recovers from database on startup
  3. Falls back to Kafka offsets if database unavailable
  4. Handles database failures gracefully

Exercise 3: Lag Alert System

Create a monitoring service that:

  1. Tracks consumer lag per partition
  2. Alerts when lag exceeds threshold
  3. Tracks lag trend (growing/shrinking)
  4. Provides estimated time to catch up

Exercise 4: Time-Based Replay

Build a feature that:

  1. Accepts a timestamp parameter
  2. Seeks all partitions to that timestamp
  3. Processes messages from that point
  4. Stops when reaching current time

Exercise 5: Commit Failure Recovery

Implement a consumer that:

  1. Uses async commits for performance
  2. Tracks uncommitted offsets
  3. Retries failed commits
  4. Handles commit failure on shutdown

Interview Questions

Q1: Explain the difference between auto-commit and manual commit.

A: The key differences are control and safety:
Auto-commit:
  • Offsets committed automatically based on timer
  • auto.commit.interval.ms controls frequency
  • No guarantee offset committed after processing
  • Simpler code but can lose or duplicate data
Manual commit:
  • Application explicitly commits offsets
  • Can commit sync or async
  • Full control over when offset is committed
  • More code but safer data handling
Risk example:
Auto-commit interval: 5s
Processing time: 10s

Timeline:
0s  - poll() returns records
5s  - auto-commit fires, offset committed
7s  - processing crashes
Result: Records marked as consumed but not processed = data loss
Best practice: Use manual commit in production for at-least-once delivery.

Q2: What happens when a consumer starts with no committed offset?

A: The behavior depends on auto.offset.reset configuration:
earliest (start from beginning):
  • Consumer reads from oldest available message
  • Good for: Data pipelines, batch processing, catching up
latest (start from end):
  • Consumer reads only new messages
  • Good for: Real-time processing, alerting
none (throw exception):
  • Consumer fails with NoOffsetForPartitionException
  • Good for: Explicit control, catching configuration errors
Scenarios that trigger reset:
  1. Brand new consumer group
  2. Committed offset < log start offset (data deleted)
  3. Committed offset > log end offset (rare, corruption)
  4. New partition added to topic

Q3: How do you handle exactly-once processing with Kafka consumers?

A: Exactly-once requires coordinating offset commits with processing output:
Option 1: Idempotent processing
JAVA(7 lines)
Code
Loading syntax highlighter...
Option 2: Transactional consume-transform-produce
JAVA(9 lines)
Code
Loading syntax highlighter...
Option 3: External transaction coordinator
JAVA(9 lines)
Code
Loading syntax highlighter...

Q4: How would you implement time-based message replay?

A: Use offsetsForTimes to seek to timestamp:
JAVA(22 lines)
Code
Loading syntax highlighter...
Considerations:
  • Timestamp must be message timestamp (not offset)
  • offsetsForTimes returns first message >= timestamp
  • Handle partitions with no messages at that time
  • May need to filter messages in certain time range

Q5: What's the relationship between consumer lag and business impact?

A: Consumer lag directly impacts data freshness and system behavior:
Measuring lag:
  • Lag = Log End Offset - Committed Offset
  • Represents messages not yet processed
  • Can be measured per partition or total
Business impact of high lag:
  1. Data staleness: Dashboards show old data
  2. Delayed alerts: Real-time notifications arrive late
  3. Inconsistency: Different consumers at different points
  4. Customer experience: Orders/events processed late
Interpreting lag trends:
Lag = 0: Perfect, real-time processing
Lag stable: Consumer keeping pace
Lag growing: Consumer falling behind - investigate!
Lag decreasing: Consumer catching up

Estimate time to catch up:
Time = Lag / (ConsumerRate - ProducerRate)
Actions for high lag:
  1. Add more consumers (if partitions allow)
  2. Optimize processing code
  3. Increase resources (CPU, memory)
  4. Reduce processing per message
  5. Add partitions (requires coordination)

Summary

Key Takeaways

  1. Auto-commit is risky for production - offset may commit before processing completes
  2. Manual commit after processing provides at-least-once delivery guarantee
  3. Commit strategy choice depends on throughput vs safety trade-offs
  4. Spring Kafka AckMode provides various commit behaviors - MANUAL most common for production
  5. Consumer lag is the key metric for consumer health - monitor and alert on it
  6. auto.offset.reset determines behavior for new or invalid offsets - choose based on use case
  7. Seeking allows replay or skip - powerful for recovery and debugging
  8. Rebalance commits are critical - always commit pending offsets before partition revocation

Quick Reference

Commit Configuration

PROPERTIES(9 lines)
Code
Loading syntax highlighter...

Spring Kafka AckMode

AckModeCommits WhenUse Case
RECORDAfter each listener callPer-message guarantee
BATCHAfter poll batch processedMost production
MANUALOn acknowledge()Complex logic
MANUAL_IMMEDIATEOn acknowledge(), syncNeed confirmation

Consumer Lag Commands

BASH(16 lines)
Code
Loading syntax highlighter...

Series Navigation

PreviousCurrentNext
Part 9: Consumer GroupsPart 10: Offset ManagementPart 11: Exactly-Once Semantics

Series Overview

  • Part 0: How to Use This Series
  • Parts 1-4: Fundamentals
  • Parts 5-7: Producers
  • Parts 8-11: Consumers (Internals, Groups, Offset Management, Exactly-Once)
  • Parts 12-14: Operations
  • Parts 15-17: Kafka Streams
  • Parts 18-20: Patterns & Practices
  • Part 21: Cheatsheet & Decision Guide