Devops
Offset Management
At a Glance
| Aspect | Details |
|---|---|
| Topic | Auto vs manual commit, commit strategies, seeking, consumer lag |
| Complexity | Intermediate |
| Prerequisites | Parts 8-9 (Consumer Internals, Consumer Groups) |
| Time | 90 minutes |
| Spring Kafka | AckMode options, manual acknowledgment |
What You'll Learn
After completing this article, you will be able to:
- Choose between auto and manual offset commit strategies
- Implement different commit patterns (per-message, per-batch, async)
- Use seeking to replay or skip messages
- Monitor consumer lag and understand its implications
- Configure Spring Kafka AckMode for your use case
Production Story: The Silent Data Loss
The Incident
Our analytics pipeline was processing clickstream data. Everything looked fine in monitoring - no errors, healthy throughput. But one day, a data analyst noticed something alarming: "We're missing 15% of our click events from yesterday. The producer sent 10 million, but we only have 8.5 million in the data warehouse."
The Investigation
JAVA(21 lines)CodeLoading syntax highlighter...
The timeline revealed the problem:
┌─────────────────────────────────────────────────────────────────────┐ │ THE AUTO-COMMIT DATA LOSS │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Configuration: │ │ • enable.auto.commit = true │ │ • auto.commit.interval.ms = 5000 (5 seconds) │ │ • max.poll.records = 1000 │ │ • Data warehouse insert time = 3-8 seconds │ │ │ │ Timeline: │ │ ───────────────────────────────────────────────────────────────── │ │ │ │ T=0.0s poll() returns 1000 records (offsets 0-999) │ │ T=0.1s Start batch insert to data warehouse │ │ ... │ │ T=5.0s AUTO-COMMIT HAPPENS (offsets committed!) │ │ But insert still in progress... │ │ ... │ │ T=7.0s Data warehouse insert fails (timeout) │ │ Exception thrown │ │ │ │ BUT: │ │ • Offsets already committed at T=5s │ │ • Those 1000 records are "consumed" in Kafka's view │ │ • They're NOT in the data warehouse │ │ • Consumer moves to next batch │ │ • 1000 events LOST! │ │ │ │ This happened multiple times yesterday: │ │ • ~15 failures × 1000 records = 15,000 lost │ │ • Out of 10,000,000 total = 0.15% loss rate │ │ • But concentrated in specific time windows = 15% loss visible │ │ │ └─────────────────────────────────────────────────────────────────────┘
The Root Cause
Auto-commit happened BEFORE processing completed:
- poll() returns records
- Processing starts
- Auto-commit timer fires (5s) - commits offset
- Processing fails
- Records lost - offset already committed
The Fix
JAVA(45 lines)CodeLoading syntax highlighter...
After the fix: Zero data loss, with at-most small duplicates on failure recovery.
Mental Model: Offset Lifecycle
┌─────────────────────────────────────────────────────────────────────────┐ │ OFFSET POSITIONS │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ Partition: orders-0 │ │ │ │ Log: │ │ ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐ │ │ │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │ 10 │ 11 │ │ │ └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘ │ │ ▲ ▲ ▲ ▲ │ │ │ │ │ │ │ │ Committed ──────┘ │ │ │ │ │ Offset: 4 │ │ │ │ │ │ │ │ │ │ Current ──────────────────┘ │ │ │ │ Position: 7 │ │ │ │ (Next record poll returns) │ │ │ │ │ │ │ │ In Progress ────────────────────────┘ │ │ │ Offset: 9 │ │ │ (Being processed) │ │ │ │ │ │ Log End ───────────────────────────────────────────┘ │ │ Offset: 12 │ │ (Next write position) │ │ │ │ Consumer Lag = Log End Offset - Committed Offset = 12 - 4 = 8 │ │ (8 messages behind) │ │ │ └─────────────────────────────────────────────────────────────────────────┘ WHAT HAPPENS ON CONSUMER RESTART: Before Restart: Committed: 4, Current: 7, Processing: records 7,8,9 After Restart: Consumer starts from Committed Offset + 1 = 5 Records 5,6,7,8,9 will be reprocessed! (Unless they were committed before crash) ┌────────────────────────────────────────────────────────────────────┐ │ │ │ This is why "commit AFTER processing" matters! │ │ │ │ Auto-commit: May commit 9 before processing finishes │ │ → If crash at 8, records 5-8 lost │ │ │ │ Manual commit after batch: Commit 9 after all processed │ │ → If crash at 8, restart at 5, reprocess 5-9 │ │ → At-least-once delivery (possible duplicates) │ │ │ └────────────────────────────────────────────────────────────────────┘
Commit Timing Impact
COMMIT TIMING SCENARIOS: 1. COMMIT BEFORE PROCESSING (Dangerous - at-most-once) ──────────────────────────────────────────────────── poll() → commit() → process() If process() fails: DATA LOST (offset already committed) Guarantee: At-most-once (may lose data) 2. COMMIT AFTER PROCESSING (Safe - at-least-once) ──────────────────────────────────────────────────── poll() → process() → commit() If process() fails: Reprocess on restart (offset not committed) Guarantee: At-least-once (may have duplicates) 3. AUTO-COMMIT (Risky - unpredictable) ──────────────────────────────────────────────────── poll() → [auto-commit may fire anytime] → process() Commit timing unpredictable relative to processing Guarantee: Neither (random data loss or duplicates) 4. TRANSACTIONAL COMMIT (Complex - exactly-once) ──────────────────────────────────────────────────── poll() → process() → produce() → commit-with-transaction() Consumer offset committed atomically with produced messages Guarantee: Exactly-once (with transactional producer)
Deep Dive
1. Auto vs Manual Commit
JAVA(34 lines)CodeLoading syntax highlighter...
When to Use Each
AUTO-COMMIT: ✓ Simple, no code changes needed ✓ Good for non-critical data (metrics, logs) ✓ When some data loss is acceptable ✗ Can lose data on processing failure ✗ No control over commit timing MANUAL COMMIT: ✓ Full control over when offsets committed ✓ At-least-once delivery guarantee ✓ Required for exactly-once processing ✗ More complex code ✗ Must handle commit failures RECOMMENDATION: Development/Testing: Auto-commit is fine Production with data: Always manual commit
2. Manual Commit Strategies
JAVA(80 lines)CodeLoading syntax highlighter...
Commit Strategy Comparison
┌─────────────────────┬─────────────┬─────────────┬─────────────────────┐ │ Strategy │ Performance │ Data Safety │ Use Case │ ├─────────────────────┼─────────────┼─────────────┼─────────────────────┤ │ Per-message sync │ Slowest │ Safest │ Financial txns │ │ │ │ │ Critical data │ ├─────────────────────┼─────────────┼─────────────┼─────────────────────┤ │ Per-batch sync │ Medium │ Safe │ Most production │ │ │ │ │ use cases │ ├─────────────────────┼─────────────┼─────────────┼─────────────────────┤ │ Async commit │ Fast │ Less safe │ High throughput, │ │ │ │ │ some loss OK │ ├─────────────────────┼─────────────┼─────────────┼─────────────────────┤ │ Periodic in batch │ Fast │ Balanced │ Large batches, │ │ │ │ │ partial progress │ ├─────────────────────┼─────────────┼─────────────┼─────────────────────┤ │ Auto-commit │ Fastest │ Risky │ Non-critical data │ │ │ │ │ Metrics, logs │ └─────────────────────┴─────────────┴─────────────┴─────────────────────┘
3. Spring Kafka AckMode Options
JAVA(17 lines)CodeLoading syntax highlighter...
AckMode Options
JAVA(46 lines)CodeLoading syntax highlighter...
4. Seeking to Specific Offsets
JAVA(82 lines)CodeLoading syntax highlighter...
5. Consumer Lag Monitoring
JAVA(50 lines)CodeLoading syntax highlighter...
Understanding Consumer Lag
CONSUMER LAG VISUALIZATION: Partition: orders-0 Time ────────────────────────────────────────────────────────────────► Producer writes (end offset increases): End: 100 200 350 500 700 1000 1200 │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Log: ─────────────────────────────────────────────────────────────► Consumer commits (committed offset increases): Committed: 50 100 200 300 500 700 800 │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ─────────────────────────────────────────────────────────► Lag = End - Committed: 50 100 150 200 200 300 400 │ │ │ │ │ │ │ │ │ │ │ │ │ └── LAG GROWING! │ │ │ │ │ │ │ │ │ └──────┴──────── LAG STABLE │ │ │ └─────┴──────┴── LAG GROWING INITIALLY WHAT LAG TELLS YOU: ───────────────────── • Lag = 0: Consumer caught up, processing in real-time • Lag stable: Consumer keeping pace with producer • Lag growing: Consumer falling behind (problem!) • Lag decreasing: Consumer catching up (recovery) CAUSES OF GROWING LAG: ────────────────────── 1. Consumer processing too slow 2. Not enough consumer instances 3. Consumer crashing/restarting 4. Producer rate increased 5. Downstream system slow (DB, API)
6. Offset Reset Policy
JAVA(27 lines)CodeLoading syntax highlighter...
When Offset Reset Triggers
OFFSET RESET SCENARIOS: 1. NEW CONSUMER GROUP ───────────────────── Consumer group "new-group" subscribes to topic for first time No committed offset exists → auto.offset.reset applies 2. OFFSET OUT OF RANGE ───────────────────── Consumer group was offline for days Committed offset: 1000 Current earliest offset: 5000 (old data deleted by retention) → Committed offset no longer valid → auto.offset.reset applies 3. PARTITION ADDED ───────────────────── Topic had 3 partitions, now has 6 Consumer group has no offset for partitions 3,4,5 → auto.offset.reset applies for new partitions RESET BEHAVIOR: ┌──────────────────────────────────────────────────────────────────┐ │ auto.offset.reset = earliest │ │ │ │ Partition: [msg-0][msg-1][msg-2]...[msg-999][msg-1000] │ │ ▲ │ │ │ │ │ Start here (beginning) │ │ │ │ Use case: Replay all data, batch processing │ └──────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────┐ │ auto.offset.reset = latest │ │ │ │ Partition: [msg-0][msg-1][msg-2]...[msg-999][msg-1000]│ │ │ ▲ │ │ │ │ │ Start here (end) │ │ │ │ Use case: Only new events matter, real-time processing │ └──────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────┐ │ auto.offset.reset = none │ │ │ │ Partition: [msg-0][msg-1][msg-2]...[msg-999][msg-1000] │ │ │ │ → NoOffsetForPartitionException thrown │ │ │ │ Use case: Fail-fast, manual offset management │ └──────────────────────────────────────────────────────────────────┘
Common Mistakes
Mistake 1: Auto-Commit with Long Processing
JAVA(19 lines)CodeLoading syntax highlighter...
Mistake 2: Committing Before Processing
JAVA(17 lines)CodeLoading syntax highlighter...
Mistake 3: Ignoring Commit Failures
JAVA(16 lines)CodeLoading syntax highlighter...
Mistake 4: Wrong Offset Reset for Use Case
JAVA(17 lines)CodeLoading syntax highlighter...
Mistake 5: Not Handling Rebalance Commits
JAVA(19 lines)CodeLoading syntax highlighter...
Debug This
Scenario: Messages Being Skipped
Symptoms:
- Some messages never processed
- Consumer lag shows 0 (caught up)
- No errors in logs
- Monitoring shows messages produced but not consumed
Investigation:
BASH(11 lines)CodeLoading syntax highlighter...
JAVA(22 lines)CodeLoading syntax highlighter...
Common Causes:
- Offset reset to latest: New group or out-of-range offset
- Seek called incorrectly: Skipping over records
- Auto-commit race: Offset committed before processing
- Consumer exception suppressed: Not propagating errors
Resolution:
JAVA(27 lines)CodeLoading syntax highlighter...
Exercises
Exercise 1: Implement Exactly-Once Processing
Create a consumer that:
- Reads from input topic
- Processes and writes to output topic
- Uses transactions for exactly-once semantics
- Handles consumer restart gracefully
Exercise 2: Build Offset Recovery System
Implement a system that:
- Stores offsets in external database
- Recovers from database on startup
- Falls back to Kafka offsets if database unavailable
- Handles database failures gracefully
Exercise 3: Lag Alert System
Create a monitoring service that:
- Tracks consumer lag per partition
- Alerts when lag exceeds threshold
- Tracks lag trend (growing/shrinking)
- Provides estimated time to catch up
Exercise 4: Time-Based Replay
Build a feature that:
- Accepts a timestamp parameter
- Seeks all partitions to that timestamp
- Processes messages from that point
- Stops when reaching current time
Exercise 5: Commit Failure Recovery
Implement a consumer that:
- Uses async commits for performance
- Tracks uncommitted offsets
- Retries failed commits
- Handles commit failure on shutdown
Interview Questions
Q1: Explain the difference between auto-commit and manual commit.
A: The key differences are control and safety:
Auto-commit:
- Offsets committed automatically based on timer
auto.commit.interval.mscontrols frequency- No guarantee offset committed after processing
- Simpler code but can lose or duplicate data
Manual commit:
- Application explicitly commits offsets
- Can commit sync or async
- Full control over when offset is committed
- More code but safer data handling
Risk example:
Auto-commit interval: 5s Processing time: 10s Timeline: 0s - poll() returns records 5s - auto-commit fires, offset committed 7s - processing crashes Result: Records marked as consumed but not processed = data loss
Best practice: Use manual commit in production for at-least-once delivery.
Q2: What happens when a consumer starts with no committed offset?
A: The behavior depends on
auto.offset.reset configuration:earliest (start from beginning):
- Consumer reads from oldest available message
- Good for: Data pipelines, batch processing, catching up
latest (start from end):
- Consumer reads only new messages
- Good for: Real-time processing, alerting
none (throw exception):
- Consumer fails with
NoOffsetForPartitionException - Good for: Explicit control, catching configuration errors
Scenarios that trigger reset:
- Brand new consumer group
- Committed offset < log start offset (data deleted)
- Committed offset > log end offset (rare, corruption)
- New partition added to topic
Q3: How do you handle exactly-once processing with Kafka consumers?
A: Exactly-once requires coordinating offset commits with processing output:
Option 1: Idempotent processing
JAVA(7 lines)CodeLoading syntax highlighter...
Option 2: Transactional consume-transform-produce
JAVA(9 lines)CodeLoading syntax highlighter...
Option 3: External transaction coordinator
JAVA(9 lines)CodeLoading syntax highlighter...
Q4: How would you implement time-based message replay?
A: Use
offsetsForTimes to seek to timestamp:JAVA(22 lines)CodeLoading syntax highlighter...
Considerations:
- Timestamp must be message timestamp (not offset)
offsetsForTimesreturns first message >= timestamp- Handle partitions with no messages at that time
- May need to filter messages in certain time range
Q5: What's the relationship between consumer lag and business impact?
A: Consumer lag directly impacts data freshness and system behavior:
Measuring lag:
- Lag = Log End Offset - Committed Offset
- Represents messages not yet processed
- Can be measured per partition or total
Business impact of high lag:
- Data staleness: Dashboards show old data
- Delayed alerts: Real-time notifications arrive late
- Inconsistency: Different consumers at different points
- Customer experience: Orders/events processed late
Interpreting lag trends:
Lag = 0: Perfect, real-time processing Lag stable: Consumer keeping pace Lag growing: Consumer falling behind - investigate! Lag decreasing: Consumer catching up Estimate time to catch up: Time = Lag / (ConsumerRate - ProducerRate)
Actions for high lag:
- Add more consumers (if partitions allow)
- Optimize processing code
- Increase resources (CPU, memory)
- Reduce processing per message
- Add partitions (requires coordination)
Summary
Key Takeaways
-
Auto-commit is risky for production - offset may commit before processing completes
-
Manual commit after processing provides at-least-once delivery guarantee
-
Commit strategy choice depends on throughput vs safety trade-offs
-
Spring Kafka AckMode provides various commit behaviors - MANUAL most common for production
-
Consumer lag is the key metric for consumer health - monitor and alert on it
-
auto.offset.reset determines behavior for new or invalid offsets - choose based on use case
-
Seeking allows replay or skip - powerful for recovery and debugging
-
Rebalance commits are critical - always commit pending offsets before partition revocation
Quick Reference
Commit Configuration
PROPERTIES(9 lines)CodeLoading syntax highlighter...
Spring Kafka AckMode
| AckMode | Commits When | Use Case |
|---|---|---|
| RECORD | After each listener call | Per-message guarantee |
| BATCH | After poll batch processed | Most production |
| MANUAL | On acknowledge() | Complex logic |
| MANUAL_IMMEDIATE | On acknowledge(), sync | Need confirmation |
Consumer Lag Commands
BASH(16 lines)CodeLoading syntax highlighter...
Series Navigation
| Previous | Current | Next |
|---|---|---|
| Part 9: Consumer Groups | Part 10: Offset Management | Part 11: Exactly-Once Semantics |
Series Overview
- Part 0: How to Use This Series
- Parts 1-4: Fundamentals
- Parts 5-7: Producers
- Parts 8-11: Consumers (Internals, Groups, Offset Management, Exactly-Once)
- Parts 12-14: Operations
- Parts 15-17: Kafka Streams
- Parts 18-20: Patterns & Practices
- Part 21: Cheatsheet & Decision Guide