Consumer Groups & Rebalancing

At a Glance

Aspect	Details
Topic	Partition assignment, rebalancing strategies, cooperative vs eager
Complexity	Intermediate-Advanced
Prerequisites	Part 8 (Consumer Internals)
Time	90 minutes
Spring Kafka	Container concurrency, rebalance listeners

What You'll Learn

After completing this article, you will be able to:

Understand how consumer groups coordinate partition assignment
Choose between partition assignment strategies (range, round-robin, sticky, cooperative)
Configure consumers to minimize rebalancing impact
Implement rebalance listeners for stateful consumers
Use static membership to eliminate unnecessary rebalances

Production Story: The Rebalancing Storm

The Incident

Our order processing cluster had 100 consumers processing from a topic with 200 partitions. One Monday morning, after a routine deployment, the entire system became unresponsive. Orders weren't being processed, and our monitoring showed consumers in a constant "Rebalancing" state.

The Investigation

BASH(11 lines)
Code
Loading syntax highlighter...

The logs told the story:

09:00:00 [Consumer-1] Joined group, received assignment: [0,1,2,3,4,5]
09:00:01 [Consumer-2] Joined group, received assignment: [6,7,8,9,10,11]
... (100 consumers joining sequentially)
09:01:30 [Consumer-1] Heartbeat failed, consumer appears to have died
09:01:30 [Consumer-2] Rebalance triggered
09:01:31 [Consumer-1] Actually still alive! Rejoining...
09:01:32 [Consumer-3] Rebalance triggered by Consumer-1 rejoining
... (cascade of rebalances)

The problem visualization:

┌─────────────────────────────────────────────────────────────────────┐
│                    THE REBALANCING STORM                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Initial state: 100 consumers, 200 partitions, EAGER protocol       │
│                                                                     │
│  09:00:00 - Deployment begins (rolling restart)                     │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ Consumer 1 restarts                                         │    │
│  │    ↓                                                        │    │
│  │ EAGER rebalance: ALL 100 consumers stop processing          │    │
│  │    ↓                                                        │    │
│  │ All 200 partitions revoked (stop-the-world!)                │    │
│  │    ↓                                                        │    │
│  │ Partition reassignment takes 30 seconds                     │    │ 
│  │    ↓                                                        │    │ 
│  │ Consumer 1 gets new assignment                              │    │ 
│  │    ↓                                                        │    │
│  │ Consumer 2 restarts (next in rolling update)                │    │
│  │    ↓                                                        │    │
│  │ ANOTHER rebalance: ALL consumers stop again!                │    │
│  │    ↓                                                        │    │
│  │ ... repeat for all 100 consumers ...                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  Result: 100 rebalances × 30 seconds each = 50 minutes of chaos     │
│          Zero messages processed during this time                   │
│                                                                     │
│  Contributing factors:                                              │
│  • Large group size (100 consumers)                                 │
│  • EAGER rebalance protocol (stop-the-world)                        │
│  • Sequential deployment (each restart triggers rebalance)          │
│  • No static membership (consumers get new IDs on restart)          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The Fix

JAVA(23 lines)
Code
Loading syntax highlighter...

After the fix:

Deployment time reduced from 50 minutes to 5 minutes
Zero stop-the-world rebalances
Continuous message processing during deployment

Mental Model: Consumer Group Coordination

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER GROUP ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Consumer Group: "order-processors"                                     │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                      GROUP COORDINATOR                            │  │
│  │                   (One broker is coordinator)                     │  │
│  │                                                                   │  │
│  │  Responsibilities:                                                │  │
│  │  • Track group membership                                         │  │
│  │  • Detect consumer failures (heartbeat timeout)                   │  │
│  │  • Manage __consumer_offsets partition                            │  │
│  │  • Trigger rebalances                                             │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                               │                                         │
│                               │ JoinGroup / SyncGroup                   │
│                               ▼                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                       GROUP LEADER                                │  │
│  │                (First consumer to join becomes leader)            │  │
│  │                                                                   │  │
│  │  Responsibilities:                                                │  │
│  │  • Receive member list from coordinator                           │  │
│  │  • Run partition assignment algorithm                             │  │
│  │  • Send assignment back to coordinator                            │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┼────────────────┐                        │
│              ▼                ▼                ▼                        │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐         │
│  │   Consumer 1     │ │   Consumer 2     │ │   Consumer 3     │         │
│  │   (LEADER)       │ │   (MEMBER)       │ │   (MEMBER)       │         │
│  │                  │ │                  │ │                  │         │
│  │  Partitions:     │ │  Partitions:     │ │  Partitions:     │         │
│  │  [0, 1, 2]       │ │  [3, 4, 5]       │ │  [6, 7, 8]       │         │
│  └──────────────────┘ └──────────────────┘ └──────────────────┘         │
│                                                                         │
│  Topic: orders (9 partitions)                                           │
│  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┐                                  │
│  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │                                  │
│  └─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┘                                  │
│    │   │   │   │   │   │   │   │   │                                    │
│    └───┴───┼───┴───┴───┼───┴───┴───┘                                    │
│            │           │                                                │
│    C1 owns │   C2 owns │   C3 owns                                      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Rebalance Protocol Flow

EAGER REBALANCE PROTOCOL (Stop-the-World):

Consumer 4 joins the group
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 1: REVOKE ALL                                                 │
│                                                                      │
│  Coordinator → All Consumers: "Prepare for rebalance"                │
│                                                                      │
│  C1: Stop processing, revoke partitions [0,1,2], commit offsets      │
│  C2: Stop processing, revoke partitions [3,4,5], commit offsets      │
│  C3: Stop processing, revoke partitions [6,7,8], commit offsets      │
│                                                                      │
│  ALL PROCESSING STOPPED                                              │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 2: JOIN GROUP                                                 │
│                                                                      │
│  All consumers (including new C4) send JoinGroup request             │
│  Coordinator waits for all members                                   │
│  Leader (C1) receives member list                                    │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 3: SYNC GROUP                                                 │
│                                                                      │
│  Leader computes new assignment:                                     │
│    C1 → [0,1]                                                        │
│    C2 → [2,3]                                                        │
│    C3 → [4,5]                                                        │
│    C4 → [6,7,8]                                                      │
│                                                                      │
│  Coordinator distributes assignments                                 │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 4: RESUME                                                     │
│                                                                      │
│  All consumers receive assignments                                   │
│  Resume processing on NEW partitions                                 │
│                                                                      │
│  PROCESSING RESUMES                                                  │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘


COOPERATIVE REBALANCE (Incremental):

Consumer 4 joins the group
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 1: PREPARE                                                    │
│                                                                      │
│  Coordinator → Leader: "C4 joining"                                  │
│  Leader computes which partitions need to move                       │
│                                                                      │
│  EXISTING ASSIGNMENTS CONTINUE PROCESSING                            │
│                                                                      │
│  To move: [6,7,8] from C3 to C4                                      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 2: FIRST REBALANCE - REVOKE ONLY                              │
│                                                                      │
│  C1: Keeps [0,1,2] - no interruption                                 │
│  C2: Keeps [3,4,5] - no interruption                                 │
│  C3: Revokes only [6,7,8], keeps nothing (will get new ones)         │
│  C4: Gets nothing yet                                                │
│                                                                      │
│  C1, C2 STILL PROCESSING                                             │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 3: SECOND REBALANCE - ASSIGN REVOKED                          │
│                                                                      │
│  C1: Keeps [0,1] - gives up [2]                                      │
│  C2: Keeps [3,4] - gives up [5]                                      │
│  C3: Gets [2,5]                                                      │
│  C4: Gets [6,7,8]                                                    │
│                                                                      │
│  MINIMAL INTERRUPTION - Only moving partitions stop                  │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Deep Dive

1. Partition Assignment Strategies

JAVA(72 lines)
Code
Loading syntax highlighter...

Assignment Strategy Comparison

ASSIGNMENT STRATEGY COMPARISON:

┌──────────────────┬─────────────────┬─────────────────┬─────────────────┐
│                  │  Distribution   │   Rebalance     │   Use Case      │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Range            │ Per-topic       │ Eager           │ Co-partitioned  │
│                  │ (uneven for     │ (stop-the-      │ topics          │
│                  │ odd partitions) │ world)          │                 │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ RoundRobin       │ Even across     │ Eager           │ Single topic    │
│                  │ all topics      │                 │ or unrelated    │
│                  │                 │                 │ topics          │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Sticky           │ Even, preserves │ Eager           │ Stateful        │
│                  │ assignments     │                 │ consumers       │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Cooperative      │ Even, preserves │ Incremental     │ RECOMMENDED     │
│ Sticky           │ assignments     │ (no stop-the-   │ for most        │
│                  │                 │ world)          │ production      │
└──────────────────┴─────────────────┴─────────────────┴─────────────────┘

2. Static Membership

JAVA(25 lines)
Code
Loading syntax highlighter...

Static vs Dynamic Membership

DYNAMIC MEMBERSHIP (Default):

Consumer restarts:
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  T=0: Consumer-1 running with partitions [0,1,2]                   │
│       member.id = "consumer-uuid-abc123"                           │
│                                                                    │
│  T=1: Consumer-1 restarts (deployment)                             │
│       Session expires → Rebalance triggered                        │
│                                                                    │
│  T=2: Consumer-1 rejoins                                           │
│       NEW member.id = "consumer-uuid-xyz789"                       │
│       Coordinator doesn't recognize it → Full rebalance            │
│                                                                    │
│  Result: Two rebalances (leave + join)                             │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘


STATIC MEMBERSHIP:

Consumer restarts:
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  T=0: Consumer-1 running with partitions [0,1,2]                   │
│       group.instance.id = "order-processor-host1"                  │
│                                                                    │
│  T=1: Consumer-1 restarts (deployment)                             │
│       Coordinator knows instance.id, waits for session.timeout.ms  │
│                                                                    │
│  T=2: Consumer-1 rejoins (within session timeout)                  │
│       Same group.instance.id = "order-processor-host1"             │
│       Coordinator: "Welcome back! Here are your old partitions"    │
│                                                                    │
│  Result: No rebalance! Same partitions restored                    │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘


When static membership triggers rebalance:
• Consumer doesn't rejoin within session.timeout.ms
• Consumer explicitly leaves (not restart)
• New consumer joins with new group.instance.id

3. Rebalance Listeners

JAVA(80 lines)
Code
Loading syntax highlighter...

4. Spring Kafka Concurrency

JAVA(44 lines)
Code
Loading syntax highlighter...

5. Reducing Rebalance Impact

JAVA(35 lines)
Code
Loading syntax highlighter...

6. Monitoring Rebalances

JAVA(46 lines)
Code
Loading syntax highlighter...

YAML(20 lines)
Code
Loading syntax highlighter...

Common Mistakes

Mistake 1: Not Using Cooperative Rebalancing

JAVA(7 lines)
Code
Loading syntax highlighter...

Mistake 2: Short Session Timeout with Static Membership

JAVA(9 lines)
Code
Loading syntax highlighter...

Mistake 3: Non-Unique Static Member IDs

JAVA(12 lines)
Code
Loading syntax highlighter...

Mistake 4: Ignoring Rebalance Listeners

JAVA(24 lines)
Code
Loading syntax highlighter...

Mistake 5: Too Many Consumers

JAVA(7 lines)
Code
Loading syntax highlighter...

Debug This

Scenario: Constant Rebalancing

Symptoms:

Consumer group never stays stable
Continuous "Preparing rebalance" state
Messages being processed multiple times

Investigation:

BASH(16 lines)
Code
Loading syntax highlighter...

JAVA(16 lines)
Code
Loading syntax highlighter...

Common Causes:

Consumer crashing: Check for OOM, exceptions
Poll timeout: Processing too slow
Mixed strategies: Consumers with different assignors
Network issues: Heartbeat failures
Frequent deployments: Without static membership

Resolution:

JAVA(13 lines)
Code
Loading syntax highlighter...

Exercises

Exercise 1: Compare Assignment Strategies

Create a test that:

Sets up topic with 7 partitions
Uses 3 consumers
Compares Range vs RoundRobin vs Sticky assignment
Documents which partitions each consumer gets

Exercise 2: Measure Rebalance Duration

Build a benchmark that:

Measures time from first revoke to last assign
Compares eager vs cooperative rebalancing
Tests with 10, 50, 100 consumers
Graphs the results

Exercise 3: Implement State Checkpoint

Create a stateful consumer that:

Maintains window aggregations per partition
Checkpoints state on rebalance
Restores state on partition assignment
Handles partition loss gracefully

Exercise 4: Rolling Deployment Simulator

Build a test that:

Simulates 10-consumer group
Performs rolling restart (one at a time)
Measures message processing continuity
Compares with and without static membership

Exercise 5: Rebalance Dashboard

Create a monitoring dashboard showing:

Current partition assignment
Rebalance history (last 24 hours)
Duration of each rebalance
Consumer join/leave events

Interview Questions

Q1: Explain the difference between eager and cooperative rebalancing.

A: The key difference is in how partitions are revoked:

Eager rebalancing (traditional):

All partitions revoked from all consumers at once
"Stop-the-world" event
No processing during rebalance
Simple but disruptive

Cooperative rebalancing (incremental):

Only affected partitions are revoked
Non-affected consumers continue processing
Two-phase: first revoke, then assign
More complex but minimal disruption

Example: When adding one consumer to a group of 3:

Eager: All 3 consumers stop → reassign all partitions → all resume Cooperative: Identify which partitions move → only those stop → assign to new consumer

Cooperative is recommended for production because it minimizes processing interruption.

Q2: What is static membership and when should you use it?

A: Static membership assigns a persistent identity to consumers via group.instance.id:

How it works:

Consumer registers with a stable ID instead of dynamic UUID
Coordinator tracks instance ID across sessions
If consumer restarts with same ID within session timeout, it rejoins without rebalance

When to use:

Kubernetes deployments: Rolling updates cause frequent restarts
Stateful consumers: Need to maintain partition affinity
Large groups: Rebalances are expensive with many consumers
Long session timeout acceptable: Must handle actual failures being detected slowly

Configuration:

JAVA(2 lines)
Code
Loading syntax highlighter...

Trade-off: Slower failure detection (must wait for session timeout) vs. fewer spurious rebalances

Q3: How does the group coordinator work?

A: The group coordinator is a broker responsible for managing a specific consumer group:

Selection:

Group ID hashed to determine __consumer_offsets partition
Leader of that partition becomes coordinator
Formula: hash(group_id) % 50 (50 partitions by default)

Responsibilities:

Accept heartbeats from consumers
Detect consumer failures (heartbeat timeout)
Handle JoinGroup and SyncGroup requests
Trigger rebalances when membership changes
Store committed offsets

Rebalance orchestration:

Coordinator receives trigger (join/leave/timeout)
Notifies all consumers to rejoin
Receives JoinGroup from all members
Sends member list to leader
Receives assignment from leader
Distributes assignment via SyncGroup

Q4: How would you handle a scenario where rebalances are causing duplicate message processing?

A: Duplicates during rebalance happen when:

Consumer processes message
Rebalance revokes partition before commit
New owner reprocesses from last committed offset

Solutions:

Commit more frequently:

JAVA(2 lines)
Code
Loading syntax highlighter...

Commit on rebalance:

JAVA(4 lines)
Code
Loading syntax highlighter...

Idempotent processing:

JAVA(6 lines)
Code
Loading syntax highlighter...

Transactional consumers (for exactly-once):

JAVA(2 lines)
Code
Loading syntax highlighter...

Q5: What factors affect how long a rebalance takes?

A: Several factors contribute to rebalance duration:

Group size:

More consumers = more JoinGroup/SyncGroup messages
O(n) network round trips

Partition count:

More partitions = larger assignment calculation
More data in SyncGroup response

Assignment strategy complexity:

Range: O(partitions)
Sticky: O(partitions × consumers) for optimal assignment

Network latency:

JoinGroup must reach all consumers
SyncGroup response sent to all
Retries if messages lost

Consumer processing:

Slow onPartitionsRevoked callback
State checkpoint/flush delays

Optimization strategies:

Use cooperative rebalancing (reduces scope)
Use static membership (avoids restart rebalances)
Keep rebalance callbacks fast
Right-size the consumer group
Monitor and tune session.timeout.ms

Summary

Key Takeaways

Consumer groups coordinate via a group coordinator broker that manages membership and triggers rebalances
Rebalances are triggered by member joins, leaves, failures, or subscription changes
Cooperative rebalancing (incremental) is far better than eager (stop-the-world) for production
Static membership prevents unnecessary rebalances during rolling deployments
Assignment strategies determine how partitions are distributed - CooperativeStickyAssignor is recommended
Rebalance listeners are essential for stateful consumers to checkpoint state
Concurrency should match partition count - extra consumers are idle
Monitor rebalance frequency - frequent rebalancing indicates configuration or infrastructure issues

Quick Reference

Recommended Configuration

PROPERTIES(10 lines)
Code
Loading syntax highlighter...

Rebalance Triggers

Trigger	Description
New consumer joins	Group membership increased
Consumer leaves	Graceful shutdown
Consumer fails	Heartbeat timeout
Subscription change	Topics added/removed
Partition count change	Topic partitions modified

Assignment Strategies

Strategy	Rebalance	Best For
Range	Eager	Co-partitioned topics
RoundRobin	Eager	Even distribution
Sticky	Eager	Minimize movement
CooperativeStickyAssignor	Cooperative	Production default

Previous	Current	Next
Part 8: Consumer Internals	Part 9: Consumer Groups	Part 10: Offset Management

Series Overview

Part 0: How to Use This Series
Parts 1-4: Fundamentals
Parts 5-7: Producers
Parts 8-11: Consumers (Internals, Groups, Offset Management, Exactly-Once)
Parts 12-14: Operations
Parts 15-17: Kafka Streams
Parts 18-20: Patterns & Practices
Part 21: Cheatsheet & Decision Guide