Leaders, ISR & Fault Tolerance

📋 At a Glance

Aspect	Details
Difficulty	🟠 Advanced
Prerequisites	Part 2 (Partitions & Replication)
Key Concepts	Leader election, ISR, HW, LEO, min.insync.replicas
Time Investment	32 minutes read + 45 minutes practice
Payoff	Understand exactly when data can be lost and how to prevent it

🎯 What You'll Learn

After this article, you'll be able to:

Explain leader election and what triggers it
Understand ISR dynamics and replica.lag.time.max.ms
Configure min.insync.replicas correctly
Prevent data loss scenarios with proper acks settings
Diagnose under-replicated partitions and ISR shrinkage

🔥 Production Story: The Unclean Election

The Setup: A financial services company ran a 5-broker Kafka cluster for trade events. Configuration: RF=3, min.insync.replicas=2, acks=all. They were confident no data could be lost.

The Incident: During a network partition, two brokers became isolated. The remaining three brokers elected new leaders.

The Symptoms:

Alert: 15,000 trade events missing!
Time window: 14:32:15 - 14:33:45 UTC

The Investigation:

BASH(4 lines)
Code
Loading syntax highlighter...

The Root Cause: They had unclean.leader.election.enable=true (an old default). When the network partition occurred:

BEFORE PARTITION:
Broker 1 (Leader, ISR): offset 1,000,000
Broker 2 (ISR):         offset 1,000,000
Broker 3 (ISR):         offset 999,985  (slightly behind, still in ISR)

NETWORK PARTITION (Brokers 1,2 isolated):
Remaining: Broker 3,4,5
Broker 3 had offset 999,985

UNCLEAN ELECTION:
Broker 3 elected leader (only available replica)
New leader offset: 999,985
Messages 999,986 - 1,000,000: LOST FOREVER

PARTITION HEALS:
Brokers 1,2 rejoin
They truncate to match new leader
15,000 messages gone

The Fix:

PROPERTIES(2 lines)
Code
Loading syntax highlighter...

Lesson Learned: With unclean leader election, Kafka prioritizes availability over consistency. For financial data, this is unacceptable. Disable it.

🧠 Mental Model: Leader, Followers, and ISR

┌─────────────────────────────────────────────────────────────────┐
│                    LEADER, FOLLOWERS & ISR                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Partition 0, RF=3                                             │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  LEADER (Broker 1)                                      │   │
│   │  ┌─────────────────────────────────────────────────────┐│   │
│   │  │ Offset: ... 95   96   97   98   99   100  101  102  ││   │
│   │  │              [M] [M] [M] [M] [M] [M] [M] [M]        ││   │
│   │  │                                      ↑              ││   │
│   │  │                            LEO (Log End Offset)     ││   │
│   │  │                                 = 102               ││   │
│   │  │                                                     ││   │
│   │  │                               ↑                     ││   │
│   │  │                    HW (High Watermark) = 100        ││   │
│   │  │                    "Committed" - safe to consume    ││   │
│   │  └─────────────────────────────────────────────────────┘│   │
│   └─────────────────────────────────────────────────────────┘   │
│         │                                                       │
│         │ Followers fetch from leader                           │
│         ▼                                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  FOLLOWER (Broker 2) - IN ISR                           │   │
│   │  ┌─────────────────────────────────────────────────────┐│   │
│   │  │ Offset: ... 95   96   97   98   99   100            ││   │
│   │  │              [M] [M] [M] [M] [M] [M]                ││   │
│   │  │                                   ↑ LEO = 100       ││   │
│   │  │            ✓ Within replica.lag.time.max.ms         ││   │
│   │  └─────────────────────────────────────────────────────┘│   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  FOLLOWER (Broker 3) - REMOVED FROM ISR                 │   │
│   │  ┌─────────────────────────────────────────────────────┐│   │
│   │  │ Offset: ... 95   96   97   98                       ││   │
│   │  │              [M] [M] [M] [M]                        ││   │
│   │  │                           ↑ LEO = 98                ││   │
│   │  │            ✗ Too far behind (lag > threshold)       ││   │
│   │  │            ✗ Removed from ISR                       ││   │
│   │  └─────────────────────────────────────────────────────┘│   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   ISR = {Broker 1, Broker 2}                                    │
│                                                                 │
│   KEY CONCEPTS:                                                 │
│   • LEO: Last offset written (leader's view)                    │
│   • HW: Last offset replicated to ALL ISR members               │
│   • Consumers can only read up to HW                            │
│   • Messages between HW and LEO: "uncommitted"                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔬 Deep Dive

1. Leader Election

Every partition has exactly one leader. Leaders handle all reads and writes.

When does leader election happen?

Broker failure: Leader goes down, new leader elected from ISR
Controlled shutdown: Leader migrates before broker stops
Rebalance: Admin triggers preferred leader election
Unclean election: No ISR available, elect from non-ISR (if enabled)

Election process:

┌─────────────────────────────────────────────────────────────────┐
│                    LEADER ELECTION FLOW                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. LEADER FAILURE DETECTED                                    │
│      Controller notices leader is unresponsive                  │
│      (via ZK session timeout or KRaft heartbeat)                │
│                                                                 │
│   2. SELECT NEW LEADER                                          │
│      Controller picks first replica from ISR                    │
│      (Ordering: prefer existing ISR, then by replica ID)        │
│                                                                 │
│   3. UPDATE METADATA                                            │
│      Controller updates cluster metadata                        │
│      New leader is authoritative                                │
│                                                                 │
│   4. NOTIFY CLIENTS                                             │
│      Producers/Consumers get new metadata                       │
│      Requests route to new leader                               │
│                                                                 │
│   Timeline: Typically < 1 second                                │
│   During election: Partition unavailable for writes             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Preferred leader election:

BASH(15 lines)
Code
Loading syntax highlighter...

2. ISR (In-Sync Replicas) Deep Dive

ISR is the set of replicas that are "caught up" with the leader.

What determines "in sync"?

PROPERTIES(6 lines)
Code
Loading syntax highlighter...

ISR dynamics:

┌─────────────────────────────────────────────────────────────────┐
│                    ISR SHRINKING AND EXPANDING                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   TIME 0: Normal operation                                      │
│   ISR = {Leader, Follower1, Follower2}                          │
│                                                                 │
│   TIME 1: Follower2 becomes slow (network issue)                │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Leader:     offset 1000                                │   │
│   │  Follower1:  offset 998   (within lag threshold) ✓      │   │
│   │  Follower2:  offset 850   (hasn't fetched in 35s) ✗     │   │
│   └─────────────────────────────────────────────────────────┘   │
│   Controller removes Follower2 from ISR                         │
│   ISR = {Leader, Follower1}                                     │
│                                                                 │
│   TIME 2: Follower2 recovers, starts catching up                │
│   ┌──────────────────────────────────────────────────────────┐  │
│   │  Leader:     offset 1200                                 │  │
│   │  Follower1:  offset 1198                                 │  │
│   │  Follower2:  offset 1195  (catching up, fetched recently)│  │
│   └──────────────────────────────────────────────────────────┘  │
│   Follower2 added back to ISR                                   │
│   ISR = {Leader, Follower1, Follower2}                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Monitoring ISR:

BASH(11 lines)
Code
Loading syntax highlighter...

3. High Watermark (HW) and Log End Offset (LEO)

These are crucial for understanding data visibility:

┌─────────────────────────────────────────────────────────────────┐
│                    HW vs LEO EXPLAINED                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   LEADER:                                                       │
│   [0][1][2][3][4][5][6][7][8][9][10][11][12]                    │
│                             ↑          ↑                        │
│                            HW=9       LEO=12                    │
│                                                                 │
│   FOLLOWER 1 (ISR):                                             │
│   [0][1][2][3][4][5][6][7][8][9][10]                            │
│                             ↑     ↑                             │
│                            HW=9  LEO=10                         │
│                                                                 │
│   FOLLOWER 2 (ISR):                                             │
│   [0][1][2][3][4][5][6][7][8][9]                                │
│                             ↑ ↑                                 │
│                           HW=LEO=9                              │
│                                                                 │
│   HW = minimum LEO across all ISR replicas                      │
│   HW = 9 (because Follower2's LEO is 9)                         │
│                                                                 │
│   VISIBILITY:                                                   │
│   • Consumers can read: [0] to [9] (up to HW)                   │
│   • Messages [10-12]: written but not yet "committed"           │
│   • If leader fails before [10-12] replicate: DATA LOST         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why this matters for acks:

JAVA(11 lines)
Code
Loading syntax highlighter...

4. min.insync.replicas

This is your safety net:

PROPERTIES(6 lines)
Code
Loading syntax highlighter...

How it works:

┌─────────────────────────────────────────────────────────────────┐
│            min.insync.replicas BEHAVIOR                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Config: RF=3, min.insync.replicas=2, acks=all                 │
│                                                                 │
│   SCENARIO 1: All brokers healthy                               │
│   ISR = {B1, B2, B3}  (size = 3)                                │
│   3 >= 2 ✓  Writes succeed                                      │
│                                                                 │
│   SCENARIO 2: One follower slow/down                            │
│   ISR = {B1, B2}  (size = 2)                                    │
│   2 >= 2 ✓  Writes succeed                                      │
│                                                                 │
│   SCENARIO 3: Two followers down                                │
│   ISR = {B1}  (size = 1)                                        │
│   1 < 2 ✗  Writes FAIL with NotEnoughReplicasException          │
│                                                                 │
│   This protects you:                                            │
│   • Can't write to single replica that might fail               │
│   • Guarantees data on at least 2 machines                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Recommended settings:

RF	min.insync.replicas	Fault Tolerance
3	2	Survives 1 broker failure
5	3	Survives 2 broker failures

Setting in Spring Kafka:

JAVA(8 lines)
Code
Loading syntax highlighter...

5. Unclean Leader Election

The most dangerous configuration in Kafka:

PROPERTIES(6 lines)
Code
Loading syntax highlighter...

When unclean election happens:

┌─────────────────────────────────────────────────────────────────┐
│                UNCLEAN LEADER ELECTION                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   INITIAL STATE:                                                │
│   Broker 1 (Leader):  offset 1000                               │
│   Broker 2 (ISR):     offset 1000                               │
│   Broker 3 (not ISR): offset 950  (lagging)                     │
│                                                                 │
│   DISASTER: Brokers 1 and 2 fail simultaneously                 │
│                                                                 │
│   IF unclean.leader.election.enable=false:                      │
│   • Partition becomes unavailable                               │
│   • Writes fail until B1 or B2 recover                          │
│   • No data loss                                                │
│                                                                 │
│   IF unclean.leader.election.enable=true:                       │
│   • Broker 3 elected as leader                                  │
│   • New leader at offset 950                                    │
│   • Messages 951-1000: GONE FOREVER                             │
│   • When B1/B2 recover, they truncate to 950                    │
│   • Partition available, but data lost                          │
│                                                                 │
│   RECOMMENDATION: Always set to false for important data        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6. The Write Path: End-to-End

Understanding how writes flow helps diagnose issues:

┌─────────────────────────────────────────────────────────────────┐
│                    WRITE PATH (acks=all)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. PRODUCER SENDS                                             │
│      Producer → Leader (Broker 1)                               │
│      Message appended to leader's log                           │
│      Leader LEO: 100 → 101                                      │
│                                                                 │
│   2. FOLLOWERS FETCH                                            │
│      Followers poll leader for new data                         │
│      Broker 2 fetches, LEO: 100 → 101                           │
│      Broker 3 fetches, LEO: 100 → 101                           │
│                                                                 │
│   3. ACKNOWLEDGE REPLICATION                                    │
│      Followers send fetch response with their LEO               │
│      Leader updates HW when all ISR caught up                   │
│      HW: 100 → 101                                              │
│                                                                 │
│   4. LEADER ACKNOWLEDGES PRODUCER                               │
│      Leader sends ack to producer                               │
│      Producer marks message as sent                             │
│                                                                 │
│   5. FOLLOWERS UPDATE HW                                        │
│      Next fetch request, followers learn new HW                 │
│      Followers update local HW: 100 → 101                       │
│                                                                 │
│   TIMING:                                                       │
│   Steps 1-4: ~5-10ms (same datacenter)                          │
│   Bottleneck: Slowest ISR replica                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

7. Failure Scenarios and Recovery

Scenario 1: Leader fails, ISR available

BEFORE:
  Broker 1 (Leader, ISR): offset 1000
  Broker 2 (ISR):         offset 1000
  Broker 3 (ISR):         offset 998

BROKER 1 FAILS:
  Controller detects failure
  New leader: Broker 2 (first in ISR)
  Broker 3 catches up to 1000

RESULT:
  No data loss
  ~1 second unavailability
  Broker 3 now at offset 1000

Scenario 2: Leader fails with uncommitted data

BEFORE:
  Broker 1 (Leader): LEO=1005, HW=1000
  Broker 2 (ISR):    LEO=1000
  Broker 3 (ISR):    LEO=1000

  Messages 1001-1005: In leader only, not yet replicated

BROKER 1 FAILS:
  New leader: Broker 2
  Leader LEO becomes 1000
  HW remains 1000

BROKER 1 RECOVERS:
  Broker 1 truncates to HW=1000
  Messages 1001-1005: LOST

LESSON:
  With acks=1, this is possible
  With acks=all, messages 1001-1005 wouldn't be acked

Spring Kafka producer configuration for durability:

JAVA(23 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Using acks=1 for Critical Data

JAVA(6 lines)
Code
Loading syntax highlighter...

Mistake 2: min.insync.replicas = RF

PROPERTIES(5 lines)
Code
Loading syntax highlighter...

Mistake 3: Ignoring Under-Replicated Partitions

BASH(6 lines)
Code
Loading syntax highlighter...

Mistake 4: Setting replica.lag.time.max.ms Too Low

PROPERTIES(8 lines)
Code
Loading syntax highlighter...

Mistake 5: Enabling Unclean Leader Election

PROPERTIES(5 lines)
Code
Loading syntax highlighter...

🐛 Debug This

You see this error in producer logs:

org.apache.kafka.common.errors.NotEnoughReplicasException:
Messages are rejected since there are fewer in-sync replicas than required.

Your configuration:

RF = 3
min.insync.replicas = 2
acks = all

All three brokers are running. What's wrong?

Click to reveal analysis

The error means: ISR size < min.insync.replicas for the partition you're writing to.

Investigation steps:

BASH(8 lines)
Code
Loading syntax highlighter...

Possible causes:

Followers can't keep up: High produce rate, followers falling behind
```
BASH(2 lines)
Code
Loading syntax highlighter...
```

Network issues: Followers can't reach leader

BASH(2 lines)
Code
Loading syntax highlighter...

Disk issues: Followers' disks are slow

BASH(2 lines)
Code
Loading syntax highlighter...

GC pauses: Long GC pauses cause followers to be removed
```
BASH(2 lines)
Code
Loading syntax highlighter...
```

Quick fix:

BASH(4 lines)
Code
Loading syntax highlighter...

Real fix: Identify why followers are out of ISR and fix the root cause.

💻 Exercises

Exercise 1: ISR Observation

BASH(7 lines)
Code
Loading syntax highlighter...

Exercise 2: min.insync.replicas Testing

JAVA(5 lines)
Code
Loading syntax highlighter...

Exercise 3: Unclean Election Simulation

BASH(6 lines)
Code
Loading syntax highlighter...

Exercise 4: High Watermark Monitoring

BASH(4 lines)
Code
Loading syntax highlighter...

Exercise 5: Preferred Leader Election

BASH(5 lines)
Code
Loading syntax highlighter...

🎤 Interview Questions

Q1: What is the ISR and why is it important?

Answer: ISR (In-Sync Replicas) is the set of replicas that are caught up with the leader within replica.lag.time.max.ms.

Importance:

Leader election: Only ISR members can become leader (with unclean election disabled). This ensures no data loss.
Write durability: With acks=all, writes are only acknowledged when replicated to all ISR members.
High watermark: HW advances only when all ISR replicas have the data. Consumers can only read up to HW.

ISR dynamics:

Replica falls behind > replica.lag.time.max.ms → Removed from ISR
Replica catches up and fetches within threshold → Added back to ISR

Monitoring: Under-replicated partitions (ISR < RF) indicate potential durability risk.

Q2: Explain the relationship between acks, min.insync.replicas, and data durability.

Answer: These settings work together to determine durability guarantees:

acks=0: No durability guarantee
        Producer doesn't wait for any acknowledgment
        Message may be lost before reaching broker

acks=1: Leader durability only
        Leader acknowledges before replication
        If leader fails immediately after ack: DATA LOSS

acks=all + min.insync.replicas=1:
        At least leader must acknowledge
        Same as acks=1 in practice

acks=all + min.insync.replicas=2:
        At least 2 replicas must have the data
        Survives 1 broker failure without data loss

acks=all + min.insync.replicas=N:
        At least N replicas must have the data
        If ISR < N, writes fail (availability trade-off)

Recommendation for critical data:

PROPERTIES(3 lines)
Code
Loading syntax highlighter...

This survives 1 broker failure without data loss or unavailability.

Q3: What is unclean leader election and when might you enable it?

Answer: Unclean leader election allows electing a leader from replicas that are NOT in the ISR—meaning replicas that may be behind.

Risk: Data loss. The new leader may be missing messages that were acknowledged.

When to enable (rare):

Availability is more important than consistency
Log data that can be regenerated
You have other durability mechanisms (e.g., source system)

When to disable (default, recommended):

Financial transactions
Critical business data
Any data that cannot be recovered

The trade-off:

unclean.leader.election.enable=false:
  ISR exhausted → Partition unavailable
  No data loss, but writes fail

unclean.leader.election.enable=true:
  ISR exhausted → Non-ISR replica becomes leader
  Writes succeed, but may lose data

Q4: How does Kafka's high watermark (HW) work?

Answer: High watermark is the offset up to which consumers can read. It ensures consumers only see "committed" data.

Mechanics:

Producer writes message to leader at offset N
Leader's LEO (Log End Offset) advances to N
Followers fetch and replicate message
When ALL ISR replicas have the message, HW advances to N
Consumers can now read offset N

Why it matters:

LEO = 100, HW = 95

Messages 96-100: Written but not fully replicated
If leader fails: Messages 96-100 may be lost
Consumers can't read them yet (can only read up to HW=95)

This is intentional: Consumers never see data that might be lost

Visibility guarantee: Consumers never see messages that could disappear on failure.

Q5: A partition has ISR={Leader}. What risks does this pose and how would you address it?

Answer: ISR size of 1 means no redundancy—critical risk.

Risks:

Data loss: If leader fails, data since last ISR sync is lost
With min.insync.replicas=2: Writes fail (safer)
With min.insync.replicas=1: Writes succeed but unprotected

Investigation:

BASH(8 lines)
Code
Loading syntax highlighter...

Remediation:

Immediate: Fix follower issues so they rejoin ISR
If broker dead: Reassign partition to healthy brokers
Monitoring: Alert on ISR < RF before it becomes critical

Prevention:

PROPERTIES(5 lines)
Code
Loading syntax highlighter...

📝 Summary & Key Takeaways

Key Concepts

Concept	Definition
Leader	Handles all reads/writes for a partition
ISR	Replicas caught up with leader
LEO	Log End Offset - last message in log
HW	High Watermark - last committed offset

Durability Configuration

PROPERTIES(8 lines)
Code
Loading syntax highlighter...

Failure Behavior

Scenario	Clean Election	Unclean Election
Leader fails, ISR available	New leader from ISR, no data loss	Same
All ISR fails	Partition unavailable	Data loss possible

📋 Quick Reference

BASH(13 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

Day 1: Understand ISR and HW concepts
Day 3: Practice ISR monitoring
Day 7: Test failure scenarios
Day 14: Review acks/min.insync.replicas interaction
Day 30: Explain durability guarantees without notes

Previous: Part 2 - Partitions & Replication
Next: Part 4 - Cluster Coordination (KRaft vs ZooKeeper)
Index: Kafka Compendium Series