Devops

Cluster Coordination (KRaft vs ZooKeeper)


At a Glance

AspectDetails
TopicCluster coordination, metadata management, controller election
ComplexityIntermediate
PrerequisitesParts 1-3 (Architecture, Partitions, Fault Tolerance)
Time90 minutes
Kafka Version3.6+ (KRaft production-ready)

What You'll Learn

After completing this article, you will be able to:

  1. Explain why ZooKeeper is being removed from Kafka's architecture
  2. Describe KRaft's controller quorum and how it handles metadata
  3. Configure a KRaft-based Kafka cluster for production
  4. Plan migration from ZooKeeper to KRaft mode
  5. Troubleshoot controller election and metadata propagation issues

Production Story: The ZooKeeper Session Timeout Storm

The Incident

It was Black Friday, and our e-commerce platform was handling 5x normal traffic. At 2:47 PM, alerts started firing: "Consumer lag increasing across all topics." Within minutes, the entire Kafka cluster became unresponsive.

The Investigation

BASH(5 lines)
Code
Loading syntax highlighter...

The cluster had 15 brokers, 200+ consumers, and 50+ producers - all maintaining ZooKeeper sessions. Under extreme load:

  1. GC pauses on ZooKeeper nodes exceeded session timeout
  2. Session expirations triggered mass reconnections
  3. Reconnection storm overwhelmed ZooKeeper
  4. Broker disconnections caused controller failover
  5. Cascading failures across the entire cluster
Timeline of Chaos:
14:47:00 - ZK node 1: Long GC pause (8 seconds)
14:47:08 - 500+ sessions expire simultaneously
14:47:09 - Reconnection storm begins
14:47:15 - ZK node 2 overwhelmed, stops responding
14:47:20 - Controller broker loses ZK session
14:47:21 - Controller election starts
14:47:45 - New controller elected, but ZK still struggling
14:48:00 - Brokers can't update metadata
14:48:30 - Producers start timing out
14:49:00 - Full cluster unavailability

The Root Cause

ZooKeeper's architecture wasn't designed for Kafka's scale:

┌─────────────────────────────────────────────────────────┐
│                   ZooKeeper Cluster                     │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐              │
│  │  ZK-1   │    │  ZK-2   │    │  ZK-3   │              │
│  │ (Leader)│◄──►│(Follower│◄──►│(Follower│              │
│  └────┬────┘    └────┬────┘    └────┬────┘              │
│       │              │              │                   │
└───────┼──────────────┼──────────────┼───────────────────┘
        │              │              │
        ▼              ▼              ▼
   ┌─────────────────────────────────────────────┐
   │        ALL connections go to ZK             │
   │                                             │
   │  15 Brokers × 1 connection = 15             │
   │  200 Consumers × 1 connection = 200         │
   │  50 Producers (old clients) = 50            │
   │  Controller = 1                             │
   │  ─────────────────────────────              │
   │  Total: 266+ persistent connections         │
   │  + All their watches and ephemeral nodes    │
   └─────────────────────────────────────────────┘

The Fix (Short-term)

PROPERTIES(11 lines)
Code
Loading syntax highlighter...

The Real Solution: KRaft Migration

We migrated to KRaft mode, eliminating ZooKeeper entirely. Result:

  • No more session storms - clients don't connect to controllers
  • Faster failover - controller election in milliseconds, not seconds
  • Simplified operations - one system instead of two
  • Better scalability - tested to millions of partitions

Mental Model: ZooKeeper vs KRaft Architecture

ZooKeeper Mode (Legacy)

┌─────────────────────────────────────────────────────────────┐
│                    ZOOKEEPER MODE                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────────────┐     ┌──────────────────────┐      │
│  │   ZooKeeper Cluster  │     │    Kafka Cluster     │      │
│  │  ┌────┐ ┌────┐ ┌────┐│     │ ┌────┐ ┌────┐ ┌────┐ │      │
│  │  │ZK-1│ │ZK-2│ │ZK-3││     │ │ B1 │ │ B2 │ │ B3 │ │      │
│  │  └──┬─┘ └──┬─┘ └──┬─┘│     │ │    │ │CTRL│ │    │ │      │
│  │     │      │      │  │     │ └──┬─┘ └──┬─┘ └──┬─┘ │      │
│  │     └──────┼──────┘  │     │    │      │      │   │      │
│  │            │         │     │    └──────┼──────┘   │      │
│  └────────────┼─────────┘     └───────────┼──────────┘      │
│               │                           │                 │
│               └───────────┬───────────────┘                 │
│                           │                                 │
│                    ZK Connection                            │
│              (All brokers connect to ZK)                    │
│                                                             │
│  Metadata stored in: ZooKeeper znodes                       │
│  Controller election: Via ZK ephemeral node                 │
│  Broker registration: ZK ephemeral nodes                    │
│  Config changes: Written to ZK, brokers watch               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

KRaft Mode (Modern)

┌─────────────────────────────────────────────────────────────┐
│                      KRAFT MODE                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Kafka Cluster (Self-Managed)           │    │
│  │                                                     │    │
│  │   Controllers (Quorum)         Brokers              │    │
│  │  ┌─────────────────────┐    ┌──────────────────┐    │    │
│  │  │ ┌────┐ ┌────┐ ┌────┐│    │ ┌────┐    ┌────┐ │    │    │
│  │  │ │ C1 │ │ C2 │ │ C3 ││    │ │ B1 │    │ B2 │ │    │    │
│  │  │ │ACT │ │FLWR│ │FLWR││    │ │    │    │    │ │    │    │
│  │  │ └──┬─┘ └──┬─┘ └──┬─┘│    │ └──┬─┘    └──┬─┘ │    │    │
│  │  │    │      │      │  │    │    │         │   │    │    │
│  │  │    └──────┼──────┘  │    │    └────┬────┘   │    │    │
│  │  │           │         │    │         │        │    │    │
│  │  └───────────┼─────────┘    └─────────┼────────┘    │    │
│  │              │                        │             │    │
│  │              └────────────────────────┘             │    │
│  │                    Metadata Push                    │    │
│  │             (Controllers push to brokers)           │    │
│  │                                                     │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  Metadata stored in: __cluster_metadata topic (Raft log)    │
│  Controller election: Raft consensus                        │
│  Broker registration: Metadata records                      │
│  Config changes: Replicated via Raft                        │
│                                                             │
│  NO ZOOKEEPER NEEDED!                                       │
└─────────────────────────────────────────────────────────────┘

Key Architectural Differences

┌────────────────────┬─────────────────────┬─────────────────────┐
│      Aspect        │     ZooKeeper       │       KRaft         │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Metadata Storage   │ ZK znodes           │ __cluster_metadata  │
│                    │ (external system)   │ (internal topic)    │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Controller         │ One active          │ Quorum (3-5 nodes)  │
│ Architecture       │ (others standby)    │ (Raft consensus)    │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Failover Time      │ Seconds to minutes  │ Milliseconds        │
│                    │ (ZK session timeout)│ (Raft heartbeat)    │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Scalability        │ ~200K partitions    │ Millions of         │
│                    │ (ZK is bottleneck)  │ partitions          │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Client Connections │ Clients → ZK        │ Clients → Brokers   │
│                    │ (for old clients)   │ (no ZK contact)     │
├────────────────────┼─────────────────────┼─────────────────────┤
│ Operational        │ Two systems         │ One system          │
│ Complexity         │ (ZK + Kafka)        │ (Kafka only)        │
└────────────────────┴─────────────────────┴─────────────────────┘

Deep Dive

1. What ZooKeeper Did for Kafka

Before understanding KRaft, let's appreciate what ZooKeeper handled:

ZooKeeper's Responsibilities in Kafka:

1. CONTROLLER ELECTION
   /controller → {"brokerid": 2, "timestamp": ...}
   (Ephemeral node - disappears when broker dies)

2. BROKER REGISTRATION
   /brokers/ids/1 → {"host": "broker1", "port": 9092, ...}
   /brokers/ids/2 → {"host": "broker2", "port": 9092, ...}
   (Ephemeral nodes for liveness detection)

3. TOPIC CONFIGURATION
   /brokers/topics/orders → {"partitions": {"0": [1,2,3], ...}}
   /config/topics/orders → {"retention.ms": "604800000"}

4. PARTITION LEADERSHIP
   /brokers/topics/orders/partitions/0/state →
   {"leader": 1, "isr": [1,2,3], "controller_epoch": 5}

5. ACLs AND QUOTAS
   /kafka-acl/Topic/orders → [acl entries]
   /config/users/alice → {"producer_byte_rate": "1000000"}

6. CONSUMER GROUP OFFSETS (Legacy)
   /consumers/my-group/offsets/orders/0 → "12345"
   (Modern Kafka uses __consumer_offsets topic instead)

Problems with ZooKeeper Dependency

JAVA(29 lines)
Code
Loading syntax highlighter...

2. KRaft Architecture Deep Dive

KRaft (Kafka Raft) replaces ZooKeeper with a built-in consensus protocol:

┌───────────────────────────────────────────────────────────────┐
│                    KRAFT CONTROLLER QUORUM                    │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │ Controller 1│    │ Controller 2│    │ Controller 3│        │
│  │   (ACTIVE)  │    │  (FOLLOWER) │    │  (FOLLOWER) │        │
│  │             │    │             │    │             │        │
│  │  Raft Log:  │    │  Raft Log:  │    │  Raft Log:  │        │
│  │  ┌────────┐ │    │  ┌────────┐ │    │  ┌────────┐ │        │
│  │  │Record 1│ │    │  │Record 1│ │    │  │Record 1│ │        │
│  │  │Record 2│ │    │  │Record 2│ │    │  │Record 2│ │        │
│  │  │Record 3│ │    │  │Record 3│ │    │  │Record 3│ │        │
│  │  │   ...  │ │    │  │   ...  │ │    │  │   ...  │ │        │
│  │  └────────┘ │    │  └────────┘ │    │  └────────┘ │        │
│  │             │    │             │    │             │        │
│  │  In-Memory  │    │  In-Memory  │    │  In-Memory  │        │
│  │  Metadata   │    │  Metadata   │    │  Metadata   │        │
│  │  Cache      │    │  Cache      │    │  Cache      │        │
│  └─────┬───────┘    └──────┬──────┘    └──────┬──────┘        │
│        │                   │                  │               │
│        │         Raft Replication             │               │
│        └───────────────────┼──────────────────┘               │
│                            │                                  │
│                            ▼                                  │
│              __cluster_metadata topic                         │
│              (The Raft log, partitioned)                      │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Metadata Records in KRaft

JAVA(20 lines)
Code
Loading syntax highlighter...

3. Controller Quorum Mechanics

RAFT CONSENSUS IN KRAFT:

┌─────────────────────────────────────────────────────────────┐
│                    LEADER ELECTION                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Initial state: No leader                                │
│     ┌────┐  ┌────┐  ┌────┐                                  │
│     │ C1 │  │ C2 │  │ C3 │  All candidates                  │
│     └────┘  └────┘  └────┘                                  │
│                                                             │
│  2. Election timeout triggers (randomized)                  │
│     ┌────┐  ┌────┐  ┌────┐                                  │
│     │ C1 │──┼──┼──►│ C2 │  C1 times out first               │
│     │CAND│  │  │   │    │  Requests votes                   │
│     └────┘  │  │   └────┘                                   │
│             │  ▼                                            │
│             │ ┌────┐                                        │
│             └►│ C3 │                                        │
│               └────┘                                        │
│                                                             │
│  3. Votes granted (majority needed)                         │
│     ┌────┐  ┌────┐  ┌────┐                                  │
│     │ C1 │◄─┤VOTE├──│ C2 │  C1 gets 2 votes                 │
│     │    │  └────┘  │    │  (self + C2)                     │
│     │    │◄─┤VOTE├──│    │                                  │
│     └────┘  └────┘  └────┘                                  │
│       ▲               │                                     │
│       └───────────────┘                                     │
│                                                             │
│  4. Leader established                                      │
│     ┌────┐  ┌────┐  ┌────┐                                  │
│     │ C1 │  │ C2 │  │ C3 │                                  │
│     │LEAD│──►FLWR│  │FLWR│  C1 is leader                    │
│     └────┘  └────┘  └────┘  Sends heartbeats                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

LOG REPLICATION:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  Leader (C1)              Followers (C2, C3)                │
│  ┌──────────────┐         ┌──────────────┐                  │
│  │ Log:         │         │ Log:         │                  │
│  │ [1] TopicA   │ ──────► │ [1] TopicA   │                  │
│  │ [2] Partition│ Append  │ [2] Partition│                  │
│  │ [3] Config   │ Entries │ [3] Config   │                  │
│  │ [4] Leader   │ ──────► │ [4] Leader   │                  │
│  └──────────────┘         └──────────────┘                  │
│                                                             │
│  Commit: Entry committed when majority acknowledges         │
│  [1] ✓ (3/3)  [2] ✓ (3/3)  [3] ✓ (2/3)  [4] ○ (1/3)         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. KRaft Configuration

Controller-Only Nodes

PROPERTIES(21 lines)
Code
Loading syntax highlighter...

Broker-Only Nodes

PROPERTIES(20 lines)
Code
Loading syntax highlighter...

Combined Mode (Development)

PROPERTIES(16 lines)
Code
Loading syntax highlighter...

5. Spring Kafka with KRaft

JAVA(43 lines)
Code
Loading syntax highlighter...

6. Admin Operations in KRaft Mode

JAVA(86 lines)
Code
Loading syntax highlighter...

7. Migration Path: ZooKeeper to KRaft

MIGRATION PHASES:

┌─────────────────────────────────────────────────────────────┐
│ Phase 1: PREPARATION                                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  • Upgrade to Kafka 3.5+ (KRaft production-ready)           │
│  • Ensure inter.broker.protocol.version = 3.5+              │
│  • Audit custom tooling for ZK dependencies                 │
│  • Plan controller node placement                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: DEPLOY CONTROLLERS                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│  │   ZK1   │    │   ZK2   │    │   ZK3   │  (Still active)  │
│  └─────────┘    └─────────┘    └─────────┘                  │
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│  │   C1    │    │   C2    │    │   C3    │  (New KRaft      │
│  │(standby)│    │(standby)│    │(standby)│   controllers)   │
│  └─────────┘    └─────────┘    └─────────┘                  │
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│  │ Broker1 │    │ Broker2 │    │ Broker3 │  (Using ZK)      │
│  └─────────┘    └─────────┘    └─────────┘                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: MIGRATION MODE                                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Run: kafka-metadata.sh snapshot --from zk --to kraft       │
│                                                             │
│  ┌─────────┐         ┌─────────────────────┐                │
│  │   ZK    │ ──────► │  __cluster_metadata │                │
│  │ znodes  │  Copy   │       (KRaft)       │                │
│  └─────────┘         └─────────────────────┘                │
│                                                             │
│  Metadata migrated, both systems active temporarily         │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: DUAL-WRITE                                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Brokers write to both ZK and KRaft controllers             │
│                                                             │
│             ┌─────────┐                                     │
│             │ Broker  │                                     │
│             └────┬────┘                                     │
│                  │                                          │
│         ┌───────┴───────┐                                   │
│         ▼               ▼                                   │
│    ┌─────────┐    ┌─────────┐                               │
│    │   ZK    │    │  KRaft  │                               │
│    │         │    │         │                               │
│    └─────────┘    └─────────┘                               │
│                                                             │
│  Validate: Both have consistent state                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 5: KRAFT ONLY                                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Run: kafka-metadata.sh finalize                            │
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│  │   ZK1   │    │   ZK2   │    │   ZK3   │  (Shutdown)      │
│  │  STOP   │    │  STOP   │    │  STOP   │                  │
│  └─────────┘    └─────────┘    └─────────┘                  │
│                                                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                  │
│  │   C1    │    │   C2    │    │   C3    │  (Active)        │
│  │ ACTIVE  │    │ FOLLWR  │    │ FOLLWR  │                  │
│  └─────────┘    └─────────┘    └─────────┘                  │
│                                                             │
│  ZooKeeper decommissioned!                                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Migration Commands

BASH(31 lines)
Code
Loading syntax highlighter...

8. Monitoring KRaft Controllers

JAVA(42 lines)
Code
Loading syntax highlighter...
YAML(30 lines)
Code
Loading syntax highlighter...

Common Mistakes

Mistake 1: Running Insufficient Controllers

PROPERTIES(12 lines)
Code
Loading syntax highlighter...

Mistake 2: Same node.id Across Nodes

PROPERTIES(15 lines)
Code
Loading syntax highlighter...

Mistake 3: Mixing ZK and KRaft Configurations

PROPERTIES(14 lines)
Code
Loading syntax highlighter...

Mistake 4: Not Formatting Storage Before First Start

BASH(11 lines)
Code
Loading syntax highlighter...

Mistake 5: Different Cluster IDs Across Nodes

BASH(16 lines)
Code
Loading syntax highlighter...

Debug This

Scenario: Controller Not Becoming Active

Symptoms:
  • All controllers show "FOLLOWER" state
  • No active controller in cluster
  • Brokers cannot register
  • Admin operations timeout
Investigation:
BASH(21 lines)
Code
Loading syntax highlighter...
JAVA(35 lines)
Code
Loading syntax highlighter...
Resolution Steps:
  1. Verify network connectivity between all controllers
  2. Ensure all controllers have the same cluster.id
  3. Check that controller.quorum.voters is identical on all nodes
  4. Verify node.id matches the ID in controller.quorum.voters
  5. Check for port conflicts on controller listener port
  6. Review controller logs for specific error messages

Exercises

Exercise 1: Local KRaft Cluster

Set up a 3-controller, 3-broker KRaft cluster using Docker Compose:

YAML(56 lines)
Code
Loading syntax highlighter...
Task: Start the cluster, verify all controllers are in the quorum, and create a topic.

Exercise 2: Controller Failover Test

With the cluster from Exercise 1:

  1. Identify the active controller
  2. Stop the active controller container
  3. Observe failover in logs
  4. Verify new leader is elected
  5. Restart the stopped controller
  6. Verify it rejoins as follower

Exercise 3: Quorum Monitoring

Write a Spring Boot application that:

  1. Connects to the KRaft cluster
  2. Periodically checks quorum status
  3. Alerts when:
    • No active controller
    • A voter is lagging
    • Less than 3 voters available

Exercise 4: Metadata Inspection

Using the kafka-metadata.sh tool:
  1. Dump the current metadata log
  2. Identify different record types
  3. Find the record for a specific topic
  4. Analyze metadata for partition assignments
BASH(4 lines)
Code
Loading syntax highlighter...

Exercise 5: Migration Planning

Given a ZooKeeper-based cluster with:

  • 5 brokers
  • 3 ZooKeeper nodes
  • 500 topics, 10,000 partitions

Create a detailed migration plan including:

  1. Hardware requirements for KRaft controllers
  2. Migration timeline with rollback points
  3. Validation steps at each phase
  4. Monitoring during migration

Interview Questions

Q1: Why is Kafka moving from ZooKeeper to KRaft?

A: Kafka is moving to KRaft for several compelling reasons:
Operational Simplicity:
  • One distributed system instead of two
  • Single security model, monitoring stack, deployment process
  • Fewer moving parts = fewer failure modes
Scalability:
  • ZooKeeper becomes a bottleneck around 200K partitions (all metadata in memory)
  • KRaft can handle millions of partitions
  • Metadata changes propagate faster (push vs poll)
Faster Recovery:
  • ZK-based controller failover takes seconds (session timeout)
  • KRaft failover takes milliseconds (Raft heartbeat)
  • Brokers recover faster because metadata is pushed, not pulled
Consistency:
  • ZK mode had inconsistency windows during metadata propagation
  • KRaft provides stronger consistency guarantees
  • Single source of truth in __cluster_metadata topic
Modern Architecture:
  • Built-in consensus protocol designed for Kafka's needs
  • Event-sourced metadata (can replay log to recover)
  • Better support for metadata snapshots and compaction

Q2: How does controller election work in KRaft?

A: KRaft uses Raft consensus for controller election:
Election Trigger:
  • Leader heartbeat timeout (followers don't hear from leader)
  • Initial cluster startup (no leader exists)
Election Process:
  1. Follower increments its term and transitions to candidate
  2. Candidate votes for itself and requests votes from other voters
  3. Each voter grants vote to first candidate in new term (first-come-first-served)
  4. Candidate becomes leader when it receives majority of votes
  5. New leader starts sending heartbeats to maintain leadership
Key Properties:
  • Randomized election timeout: Prevents split votes (candidates start elections at different times)
  • Term numbers: Prevent stale leaders from causing confusion
  • Majority requirement: Ensures only one leader per term
  • Persistent vote: Voters remember who they voted for (survives restarts)
Failover Characteristics:
  • Typical election time: 100-500ms
  • Requires majority of voters (2/3, 3/5, etc.)
  • No split-brain because only one candidate can get majority

Q3: What happens to clients during a controller failover in KRaft?

A: The impact on clients is minimal in KRaft mode:
Producers:
  • Continue producing normally (producers talk to brokers, not controllers)
  • May see brief retry if producing to partition that needs leader update
  • Typically transparent (retries happen automatically)
Consumers:
  • Continue consuming normally (consumers talk to brokers, not controllers)
  • May see brief pause if fetching from partition needing leader update
  • Offset commits unaffected (goes to __consumer_offsets on brokers)
Admin Operations:
  • Topic creation/deletion temporarily blocked during failover
  • Config changes temporarily blocked
  • Resume automatically once new controller is active
Why Minimal Impact:
  • Clients only interact with brokers, never directly with controllers
  • Brokers cache metadata locally (serve clients from cache)
  • Controller failover is fast (milliseconds)
  • Brokers automatically refresh metadata from new controller

Q4: What's the __cluster_metadata topic and how is it different from regular topics?

A: __cluster_metadata is a special internal topic that stores all cluster metadata in KRaft mode:
Structure:
  • Single partition (partition 0)
  • Replicated across all controller nodes (not regular brokers)
  • Uses Raft consensus for replication (not standard Kafka replication)
  • Not accessible via normal producer/consumer APIs
Contents:
  • Broker registrations and fencing
  • Topic and partition metadata
  • Configuration changes
  • ACLs and quotas
  • Producer ID allocations
  • Feature flags
How It Differs from Regular Topics:
AspectRegular Topics__cluster_metadata
ReplicationISR-basedRaft consensus
ProducersAny clientOnly active controller
ConsumersAny clientControllers only
StorageBroker data dirsController metadata dirs
CompactionOptionalAlways (implicit)
AccessPublic APIInternal only
Event Sourcing:
  • All changes are appended as records
  • State can be reconstructed by replaying log
  • Periodic snapshots for faster recovery
  • Similar to event sourcing pattern in applications

Q5: How do you choose between combined mode and separate controller/broker roles?

A: The choice depends on cluster size and operational requirements:
Combined Mode (process.roles=broker,controller):

Best for:

  • Development and testing environments
  • Small clusters (3-5 nodes)
  • Resource-constrained deployments
  • Simpler operations

Drawbacks:

  • Controller and broker compete for resources
  • GC pauses on broker affect controller
  • Harder to scale controllers independently
Separate Roles:

Best for:

  • Production environments
  • Large clusters (10+ brokers)
  • High-throughput workloads
  • When controller stability is critical

Benefits:

  • Dedicated resources for controllers
  • Controllers isolated from broker load
  • Can scale brokers without touching controllers
  • Predictable controller performance
Sizing Guidelines:
Cluster SizeRecommendation
1-3 nodesCombined mode (dev only)
3-5 nodesCombined or separate
5-10 nodesSeparate recommended
10+ nodesSeparate required
Controller Resource Requirements:
  • CPU: Low (metadata operations are lightweight)
  • Memory: 4-8GB (metadata in memory)
  • Disk: SSD recommended (Raft log performance)
  • Network: Low bandwidth, but low latency important

Summary

Key Takeaways

  1. ZooKeeper was Kafka's original coordination service but became a bottleneck and operational burden at scale
  2. KRaft replaces ZooKeeper with a built-in Raft-based consensus protocol, eliminating external dependencies
  3. Controller quorum uses Raft consensus with 3-5 controller nodes for fault tolerance (odd numbers required)
  4. __cluster_metadata topic stores all cluster state as an event-sourced log, enabling fast recovery
  5. Migration is production-ready in Kafka 3.5+ with a well-defined path from ZooKeeper
  6. Failover is faster in KRaft (milliseconds vs seconds) because it uses Raft heartbeats instead of ZK sessions
  7. Clients are unaffected by controller failover because they only interact with brokers
  8. Spring Kafka requires no changes for KRaft - just point to broker bootstrap servers

Quick Reference

Essential KRaft Configuration

PROPERTIES(11 lines)
Code
Loading syntax highlighter...

Key Commands

BASH(14 lines)
Code
Loading syntax highlighter...

ZK vs KRaft Quick Comparison

ZooKeeperKRaft
External dependencyYes (3-5 ZK nodes)No
Max partitions~200KMillions
Controller failover5-30 seconds<1 second
Metadata consistencyEventuallyStrongly
Operational complexityHighLow
Production readyYesYes (3.5+)

Series Navigation

PreviousCurrentNext
Part 3: Fault TolerancePart 4: Cluster CoordinationPart 5: Producer Internals

Series Overview

  • Part 0: How to Use This Series
  • Parts 1-4: Fundamentals (Architecture, Partitions, Fault Tolerance, Cluster Coordination)
  • Parts 5-7: Producers
  • Parts 8-11: Consumers
  • Parts 12-14: Operations
  • Parts 15-17: Kafka Streams
  • Parts 18-20: Patterns & Practices
  • Part 21: Cheatsheet & Decision Guide