Devops
Leaders, ISR & Fault Tolerance
π At a Glance
| Aspect | Details |
|---|---|
| Difficulty | π Advanced |
| Prerequisites | Part 2 (Partitions & Replication) |
| Key Concepts | Leader election, ISR, HW, LEO, min.insync.replicas |
| Time Investment | 32 minutes read + 45 minutes practice |
| Payoff | Understand exactly when data can be lost and how to prevent it |
π― What You'll Learn
After this article, you'll be able to:
- Explain leader election and what triggers it
- Understand ISR dynamics and replica.lag.time.max.ms
- Configure min.insync.replicas correctly
- Prevent data loss scenarios with proper acks settings
- Diagnose under-replicated partitions and ISR shrinkage
π₯ Production Story: The Unclean Election
The Setup: A financial services company ran a 5-broker Kafka cluster for trade events. Configuration: RF=3, min.insync.replicas=2, acks=all. They were confident no data could be lost.
The Incident: During a network partition, two brokers became isolated. The remaining three brokers elected new leaders.
The Symptoms:
Alert: 15,000 trade events missing! Time window: 14:32:15 - 14:33:45 UTC
The Investigation:
BASH(4 lines)CodeLoading syntax highlighter...
The Root Cause: They had
unclean.leader.election.enable=true (an old default). When the network partition occurred:BEFORE PARTITION: Broker 1 (Leader, ISR): offset 1,000,000 Broker 2 (ISR): offset 1,000,000 Broker 3 (ISR): offset 999,985 (slightly behind, still in ISR) NETWORK PARTITION (Brokers 1,2 isolated): Remaining: Broker 3,4,5 Broker 3 had offset 999,985 UNCLEAN ELECTION: Broker 3 elected leader (only available replica) New leader offset: 999,985 Messages 999,986 - 1,000,000: LOST FOREVER PARTITION HEALS: Brokers 1,2 rejoin They truncate to match new leader 15,000 messages gone
The Fix:
PROPERTIES(2 lines)CodeLoading syntax highlighter...
Lesson Learned: With unclean leader election, Kafka prioritizes availability over consistency. For financial data, this is unacceptable. Disable it.
π§ Mental Model: Leader, Followers, and ISR
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LEADER, FOLLOWERS & ISR β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Partition 0, RF=3 β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β LEADER (Broker 1) β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Offset: ... 95 96 97 98 99 100 101 102 ββ β β β β [M] [M] [M] [M] [M] [M] [M] [M] ββ β β β β β ββ β β β β LEO (Log End Offset) ββ β β β β = 102 ββ β β β β ββ β β β β β ββ β β β β HW (High Watermark) = 100 ββ β β β β "Committed" - safe to consume ββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β β Followers fetch from leader β β βΌ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β FOLLOWER (Broker 2) - IN ISR β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Offset: ... 95 96 97 98 99 100 ββ β β β β [M] [M] [M] [M] [M] [M] ββ β β β β β LEO = 100 ββ β β β β β Within replica.lag.time.max.ms ββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β FOLLOWER (Broker 3) - REMOVED FROM ISR β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Offset: ... 95 96 97 98 ββ β β β β [M] [M] [M] [M] ββ β β β β β LEO = 98 ββ β β β β β Too far behind (lag > threshold) ββ β β β β β Removed from ISR ββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β ISR = {Broker 1, Broker 2} β β β β KEY CONCEPTS: β β β’ LEO: Last offset written (leader's view) β β β’ HW: Last offset replicated to ALL ISR members β β β’ Consumers can only read up to HW β β β’ Messages between HW and LEO: "uncommitted" β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¬ Deep Dive
1. Leader Election
Every partition has exactly one leader. Leaders handle all reads and writes.
When does leader election happen?
- Broker failure: Leader goes down, new leader elected from ISR
- Controlled shutdown: Leader migrates before broker stops
- Rebalance: Admin triggers preferred leader election
- Unclean election: No ISR available, elect from non-ISR (if enabled)
Election process:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LEADER ELECTION FLOW β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 1. LEADER FAILURE DETECTED β β Controller notices leader is unresponsive β β (via ZK session timeout or KRaft heartbeat) β β β β 2. SELECT NEW LEADER β β Controller picks first replica from ISR β β (Ordering: prefer existing ISR, then by replica ID) β β β β 3. UPDATE METADATA β β Controller updates cluster metadata β β New leader is authoritative β β β β 4. NOTIFY CLIENTS β β Producers/Consumers get new metadata β β Requests route to new leader β β β β Timeline: Typically < 1 second β β During election: Partition unavailable for writes β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preferred leader election:
BASH(15 lines)CodeLoading syntax highlighter...
2. ISR (In-Sync Replicas) Deep Dive
ISR is the set of replicas that are "caught up" with the leader.
What determines "in sync"?
PROPERTIES(6 lines)CodeLoading syntax highlighter...
ISR dynamics:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ISR SHRINKING AND EXPANDING β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β TIME 0: Normal operation β β ISR = {Leader, Follower1, Follower2} β β β β TIME 1: Follower2 becomes slow (network issue) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Leader: offset 1000 β β β β Follower1: offset 998 (within lag threshold) β β β β β Follower2: offset 850 (hasn't fetched in 35s) β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β Controller removes Follower2 from ISR β β ISR = {Leader, Follower1} β β β β TIME 2: Follower2 recovers, starts catching up β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Leader: offset 1200 β β β β Follower1: offset 1198 β β β β Follower2: offset 1195 (catching up, fetched recently)β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β Follower2 added back to ISR β β ISR = {Leader, Follower1, Follower2} β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Monitoring ISR:
BASH(11 lines)CodeLoading syntax highlighter...
3. High Watermark (HW) and Log End Offset (LEO)
These are crucial for understanding data visibility:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β HW vs LEO EXPLAINED β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β LEADER: β β [0][1][2][3][4][5][6][7][8][9][10][11][12] β β β β β β HW=9 LEO=12 β β β β FOLLOWER 1 (ISR): β β [0][1][2][3][4][5][6][7][8][9][10] β β β β β β HW=9 LEO=10 β β β β FOLLOWER 2 (ISR): β β [0][1][2][3][4][5][6][7][8][9] β β β β β β HW=LEO=9 β β β β HW = minimum LEO across all ISR replicas β β HW = 9 (because Follower2's LEO is 9) β β β β VISIBILITY: β β β’ Consumers can read: [0] to [9] (up to HW) β β β’ Messages [10-12]: written but not yet "committed" β β β’ If leader fails before [10-12] replicate: DATA LOST β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why this matters for acks:
JAVA(11 lines)CodeLoading syntax highlighter...
4. min.insync.replicas
This is your safety net:
PROPERTIES(6 lines)CodeLoading syntax highlighter...
How it works:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β min.insync.replicas BEHAVIOR β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Config: RF=3, min.insync.replicas=2, acks=all β β β β SCENARIO 1: All brokers healthy β β ISR = {B1, B2, B3} (size = 3) β β 3 >= 2 β Writes succeed β β β β SCENARIO 2: One follower slow/down β β ISR = {B1, B2} (size = 2) β β 2 >= 2 β Writes succeed β β β β SCENARIO 3: Two followers down β β ISR = {B1} (size = 1) β β 1 < 2 β Writes FAIL with NotEnoughReplicasException β β β β This protects you: β β β’ Can't write to single replica that might fail β β β’ Guarantees data on at least 2 machines β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Recommended settings:
| RF | min.insync.replicas | Fault Tolerance |
|---|---|---|
| 3 | 2 | Survives 1 broker failure |
| 5 | 3 | Survives 2 broker failures |
Setting in Spring Kafka:
JAVA(8 lines)CodeLoading syntax highlighter...
5. Unclean Leader Election
The most dangerous configuration in Kafka:
PROPERTIES(6 lines)CodeLoading syntax highlighter...
When unclean election happens:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β UNCLEAN LEADER ELECTION β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β INITIAL STATE: β β Broker 1 (Leader): offset 1000 β β Broker 2 (ISR): offset 1000 β β Broker 3 (not ISR): offset 950 (lagging) β β β β DISASTER: Brokers 1 and 2 fail simultaneously β β β β IF unclean.leader.election.enable=false: β β β’ Partition becomes unavailable β β β’ Writes fail until B1 or B2 recover β β β’ No data loss β β β β IF unclean.leader.election.enable=true: β β β’ Broker 3 elected as leader β β β’ New leader at offset 950 β β β’ Messages 951-1000: GONE FOREVER β β β’ When B1/B2 recover, they truncate to 950 β β β’ Partition available, but data lost β β β β RECOMMENDATION: Always set to false for important data β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. The Write Path: End-to-End
Understanding how writes flow helps diagnose issues:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β WRITE PATH (acks=all) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 1. PRODUCER SENDS β β Producer β Leader (Broker 1) β β Message appended to leader's log β β Leader LEO: 100 β 101 β β β β 2. FOLLOWERS FETCH β β Followers poll leader for new data β β Broker 2 fetches, LEO: 100 β 101 β β Broker 3 fetches, LEO: 100 β 101 β β β β 3. ACKNOWLEDGE REPLICATION β β Followers send fetch response with their LEO β β Leader updates HW when all ISR caught up β β HW: 100 β 101 β β β β 4. LEADER ACKNOWLEDGES PRODUCER β β Leader sends ack to producer β β Producer marks message as sent β β β β 5. FOLLOWERS UPDATE HW β β Next fetch request, followers learn new HW β β Followers update local HW: 100 β 101 β β β β TIMING: β β Steps 1-4: ~5-10ms (same datacenter) β β Bottleneck: Slowest ISR replica β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7. Failure Scenarios and Recovery
Scenario 1: Leader fails, ISR available
BEFORE: Broker 1 (Leader, ISR): offset 1000 Broker 2 (ISR): offset 1000 Broker 3 (ISR): offset 998 BROKER 1 FAILS: Controller detects failure New leader: Broker 2 (first in ISR) Broker 3 catches up to 1000 RESULT: No data loss ~1 second unavailability Broker 3 now at offset 1000
Scenario 2: Leader fails with uncommitted data
BEFORE: Broker 1 (Leader): LEO=1005, HW=1000 Broker 2 (ISR): LEO=1000 Broker 3 (ISR): LEO=1000 Messages 1001-1005: In leader only, not yet replicated BROKER 1 FAILS: New leader: Broker 2 Leader LEO becomes 1000 HW remains 1000 BROKER 1 RECOVERS: Broker 1 truncates to HW=1000 Messages 1001-1005: LOST LESSON: With acks=1, this is possible With acks=all, messages 1001-1005 wouldn't be acked
Spring Kafka producer configuration for durability:
JAVA(23 lines)CodeLoading syntax highlighter...
β οΈ Common Mistakes
Mistake 1: Using acks=1 for Critical Data
JAVA(6 lines)CodeLoading syntax highlighter...
Mistake 2: min.insync.replicas = RF
PROPERTIES(5 lines)CodeLoading syntax highlighter...
Mistake 3: Ignoring Under-Replicated Partitions
BASH(6 lines)CodeLoading syntax highlighter...
Mistake 4: Setting replica.lag.time.max.ms Too Low
PROPERTIES(8 lines)CodeLoading syntax highlighter...
Mistake 5: Enabling Unclean Leader Election
PROPERTIES(5 lines)CodeLoading syntax highlighter...
π Debug This
You see this error in producer logs:
org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
Your configuration:
- RF = 3
- min.insync.replicas = 2
- acks = all
All three brokers are running. What's wrong?
Click to reveal analysis
The error means: ISR size < min.insync.replicas for the partition you're writing to.
Investigation steps:
BASH(8 lines)CodeLoading syntax highlighter...
Possible causes:
-
Followers can't keep up: High produce rate, followers falling behindBASH(2 lines)CodeLoading syntax highlighter...
-
Network issues: Followers can't reach leaderBASH(2 lines)CodeLoading syntax highlighter...
-
Disk issues: Followers' disks are slowBASH(2 lines)CodeLoading syntax highlighter...
-
GC pauses: Long GC pauses cause followers to be removedBASH(2 lines)CodeLoading syntax highlighter...
Quick fix:
BASH(4 lines)CodeLoading syntax highlighter...
Real fix: Identify why followers are out of ISR and fix the root cause.
π» Exercises
Exercise 1: ISR Observation
BASH(7 lines)CodeLoading syntax highlighter...
Exercise 2: min.insync.replicas Testing
JAVA(5 lines)CodeLoading syntax highlighter...
Exercise 3: Unclean Election Simulation
BASH(6 lines)CodeLoading syntax highlighter...
Exercise 4: High Watermark Monitoring
BASH(4 lines)CodeLoading syntax highlighter...
Exercise 5: Preferred Leader Election
BASH(5 lines)CodeLoading syntax highlighter...
π€ Interview Questions
Q1: What is the ISR and why is it important?
Answer: ISR (In-Sync Replicas) is the set of replicas that are caught up with the leader within
replica.lag.time.max.ms.Importance:
-
Leader election: Only ISR members can become leader (with unclean election disabled). This ensures no data loss.
-
Write durability: With
acks=all, writes are only acknowledged when replicated to all ISR members. -
High watermark: HW advances only when all ISR replicas have the data. Consumers can only read up to HW.
ISR dynamics:
Replica falls behind > replica.lag.time.max.ms β Removed from ISR Replica catches up and fetches within threshold β Added back to ISR
Monitoring: Under-replicated partitions (ISR < RF) indicate potential durability risk.
Q2: Explain the relationship between acks, min.insync.replicas, and data durability.
Answer: These settings work together to determine durability guarantees:
acks=0: No durability guarantee Producer doesn't wait for any acknowledgment Message may be lost before reaching broker acks=1: Leader durability only Leader acknowledges before replication If leader fails immediately after ack: DATA LOSS acks=all + min.insync.replicas=1: At least leader must acknowledge Same as acks=1 in practice acks=all + min.insync.replicas=2: At least 2 replicas must have the data Survives 1 broker failure without data loss acks=all + min.insync.replicas=N: At least N replicas must have the data If ISR < N, writes fail (availability trade-off)
Recommendation for critical data:
PROPERTIES(3 lines)CodeLoading syntax highlighter...
This survives 1 broker failure without data loss or unavailability.
Q3: What is unclean leader election and when might you enable it?
Answer: Unclean leader election allows electing a leader from replicas that are NOT in the ISRβmeaning replicas that may be behind.
Risk: Data loss. The new leader may be missing messages that were acknowledged.
When to enable (rare):
- Availability is more important than consistency
- Log data that can be regenerated
- You have other durability mechanisms (e.g., source system)
When to disable (default, recommended):
- Financial transactions
- Critical business data
- Any data that cannot be recovered
The trade-off:
unclean.leader.election.enable=false: ISR exhausted β Partition unavailable No data loss, but writes fail unclean.leader.election.enable=true: ISR exhausted β Non-ISR replica becomes leader Writes succeed, but may lose data
Q4: How does Kafka's high watermark (HW) work?
Answer: High watermark is the offset up to which consumers can read. It ensures consumers only see "committed" data.
Mechanics:
- Producer writes message to leader at offset N
- Leader's LEO (Log End Offset) advances to N
- Followers fetch and replicate message
- When ALL ISR replicas have the message, HW advances to N
- Consumers can now read offset N
Why it matters:
LEO = 100, HW = 95 Messages 96-100: Written but not fully replicated If leader fails: Messages 96-100 may be lost Consumers can't read them yet (can only read up to HW=95) This is intentional: Consumers never see data that might be lost
Visibility guarantee: Consumers never see messages that could disappear on failure.
Q5: A partition has ISR={Leader}. What risks does this pose and how would you address it?
Answer: ISR size of 1 means no redundancyβcritical risk.
Risks:
- Data loss: If leader fails, data since last ISR sync is lost
- With min.insync.replicas=2: Writes fail (safer)
- With min.insync.replicas=1: Writes succeed but unprotected
Investigation:
BASH(8 lines)CodeLoading syntax highlighter...
Remediation:
- Immediate: Fix follower issues so they rejoin ISR
- If broker dead: Reassign partition to healthy brokers
- Monitoring: Alert on ISR < RF before it becomes critical
Prevention:
PROPERTIES(5 lines)CodeLoading syntax highlighter...
π Summary & Key Takeaways
Key Concepts
| Concept | Definition |
|---|---|
| Leader | Handles all reads/writes for a partition |
| ISR | Replicas caught up with leader |
| LEO | Log End Offset - last message in log |
| HW | High Watermark - last committed offset |
Durability Configuration
PROPERTIES(8 lines)CodeLoading syntax highlighter...
Failure Behavior
| Scenario | Clean Election | Unclean Election |
|---|---|---|
| Leader fails, ISR available | New leader from ISR, no data loss | Same |
| All ISR fails | Partition unavailable | Data loss possible |
π Quick Reference
BASH(13 lines)CodeLoading syntax highlighter...
π Review Schedule
- Day 1: Understand ISR and HW concepts
- Day 3: Practice ISR monitoring
- Day 7: Test failure scenarios
- Day 14: Review acks/min.insync.replicas interaction
- Day 30: Explain durability guarantees without notes