Devops

Consumer Groups & Rebalancing


At a Glance

AspectDetails
TopicPartition assignment, rebalancing strategies, cooperative vs eager
ComplexityIntermediate-Advanced
PrerequisitesPart 8 (Consumer Internals)
Time90 minutes
Spring KafkaContainer concurrency, rebalance listeners

What You'll Learn

After completing this article, you will be able to:

  1. Understand how consumer groups coordinate partition assignment
  2. Choose between partition assignment strategies (range, round-robin, sticky, cooperative)
  3. Configure consumers to minimize rebalancing impact
  4. Implement rebalance listeners for stateful consumers
  5. Use static membership to eliminate unnecessary rebalances

Production Story: The Rebalancing Storm

The Incident

Our order processing cluster had 100 consumers processing from a topic with 200 partitions. One Monday morning, after a routine deployment, the entire system became unresponsive. Orders weren't being processed, and our monitoring showed consumers in a constant "Rebalancing" state.

The Investigation

BASH(11 lines)
Code
Loading syntax highlighter...

The logs told the story:

09:00:00 [Consumer-1] Joined group, received assignment: [0,1,2,3,4,5]
09:00:01 [Consumer-2] Joined group, received assignment: [6,7,8,9,10,11]
... (100 consumers joining sequentially)
09:01:30 [Consumer-1] Heartbeat failed, consumer appears to have died
09:01:30 [Consumer-2] Rebalance triggered
09:01:31 [Consumer-1] Actually still alive! Rejoining...
09:01:32 [Consumer-3] Rebalance triggered by Consumer-1 rejoining
... (cascade of rebalances)

The problem visualization:

┌─────────────────────────────────────────────────────────────────────┐
│                    THE REBALANCING STORM                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Initial state: 100 consumers, 200 partitions, EAGER protocol       │
│                                                                     │
│  09:00:00 - Deployment begins (rolling restart)                     │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ Consumer 1 restarts                                         │    │
│  │    ↓                                                        │    │
│  │ EAGER rebalance: ALL 100 consumers stop processing          │    │
│  │    ↓                                                        │    │
│  │ All 200 partitions revoked (stop-the-world!)                │    │
│  │    ↓                                                        │    │
│  │ Partition reassignment takes 30 seconds                     │    │ 
│  │    ↓                                                        │    │ 
│  │ Consumer 1 gets new assignment                              │    │ 
│  │    ↓                                                        │    │
│  │ Consumer 2 restarts (next in rolling update)                │    │
│  │    ↓                                                        │    │
│  │ ANOTHER rebalance: ALL consumers stop again!                │    │
│  │    ↓                                                        │    │
│  │ ... repeat for all 100 consumers ...                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  Result: 100 rebalances × 30 seconds each = 50 minutes of chaos     │
│          Zero messages processed during this time                   │
│                                                                     │
│  Contributing factors:                                              │
│  • Large group size (100 consumers)                                 │
│  • EAGER rebalance protocol (stop-the-world)                        │
│  • Sequential deployment (each restart triggers rebalance)          │
│  • No static membership (consumers get new IDs on restart)          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

The Fix

JAVA(23 lines)
Code
Loading syntax highlighter...

After the fix:

  • Deployment time reduced from 50 minutes to 5 minutes
  • Zero stop-the-world rebalances
  • Continuous message processing during deployment

Mental Model: Consumer Group Coordination

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONSUMER GROUP ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Consumer Group: "order-processors"                                     │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                      GROUP COORDINATOR                            │  │
│  │                   (One broker is coordinator)                     │  │
│  │                                                                   │  │
│  │  Responsibilities:                                                │  │
│  │  • Track group membership                                         │  │
│  │  • Detect consumer failures (heartbeat timeout)                   │  │
│  │  • Manage __consumer_offsets partition                            │  │
│  │  • Trigger rebalances                                             │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                               │                                         │
│                               │ JoinGroup / SyncGroup                   │
│                               ▼                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                       GROUP LEADER                                │  │
│  │                (First consumer to join becomes leader)            │  │
│  │                                                                   │  │
│  │  Responsibilities:                                                │  │
│  │  • Receive member list from coordinator                           │  │
│  │  • Run partition assignment algorithm                             │  │
│  │  • Send assignment back to coordinator                            │  │
│  │                                                                   │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                               │                                         │
│              ┌────────────────┼────────────────┐                        │
│              ▼                ▼                ▼                        │
│  ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐         │
│  │   Consumer 1     │ │   Consumer 2     │ │   Consumer 3     │         │
│  │   (LEADER)       │ │   (MEMBER)       │ │   (MEMBER)       │         │
│  │                  │ │                  │ │                  │         │
│  │  Partitions:     │ │  Partitions:     │ │  Partitions:     │         │
│  │  [0, 1, 2]       │ │  [3, 4, 5]       │ │  [6, 7, 8]       │         │
│  └──────────────────┘ └──────────────────┘ └──────────────────┘         │
│                                                                         │
│  Topic: orders (9 partitions)                                           │
│  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┐                                  │
│  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │                                  │
│  └─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┘                                  │
│    │   │   │   │   │   │   │   │   │                                    │
│    └───┴───┼───┴───┴───┼───┴───┴───┘                                    │
│            │           │                                                │
│    C1 owns │   C2 owns │   C3 owns                                      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Rebalance Protocol Flow

EAGER REBALANCE PROTOCOL (Stop-the-World):

Consumer 4 joins the group
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 1: REVOKE ALL                                                 │
│                                                                      │
│  Coordinator → All Consumers: "Prepare for rebalance"                │
│                                                                      │
│  C1: Stop processing, revoke partitions [0,1,2], commit offsets      │
│  C2: Stop processing, revoke partitions [3,4,5], commit offsets      │
│  C3: Stop processing, revoke partitions [6,7,8], commit offsets      │
│                                                                      │
│  ALL PROCESSING STOPPED                                              │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 2: JOIN GROUP                                                 │
│                                                                      │
│  All consumers (including new C4) send JoinGroup request             │
│  Coordinator waits for all members                                   │
│  Leader (C1) receives member list                                    │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 3: SYNC GROUP                                                 │
│                                                                      │
│  Leader computes new assignment:                                     │
│    C1 → [0,1]                                                        │
│    C2 → [2,3]                                                        │
│    C3 → [4,5]                                                        │
│    C4 → [6,7,8]                                                      │
│                                                                      │
│  Coordinator distributes assignments                                 │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 4: RESUME                                                     │
│                                                                      │
│  All consumers receive assignments                                   │
│  Resume processing on NEW partitions                                 │
│                                                                      │
│  PROCESSING RESUMES                                                  │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘


COOPERATIVE REBALANCE (Incremental):

Consumer 4 joins the group
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 1: PREPARE                                                    │
│                                                                      │
│  Coordinator → Leader: "C4 joining"                                  │
│  Leader computes which partitions need to move                       │
│                                                                      │
│  EXISTING ASSIGNMENTS CONTINUE PROCESSING                            │
│                                                                      │
│  To move: [6,7,8] from C3 to C4                                      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 2: FIRST REBALANCE - REVOKE ONLY                              │
│                                                                      │
│  C1: Keeps [0,1,2] - no interruption                                 │
│  C2: Keeps [3,4,5] - no interruption                                 │
│  C3: Revokes only [6,7,8], keeps nothing (will get new ones)         │
│  C4: Gets nothing yet                                                │
│                                                                      │
│  C1, C2 STILL PROCESSING                                             │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 3: SECOND REBALANCE - ASSIGN REVOKED                          │
│                                                                      │
│  C1: Keeps [0,1] - gives up [2]                                      │
│  C2: Keeps [3,4] - gives up [5]                                      │
│  C3: Gets [2,5]                                                      │
│  C4: Gets [6,7,8]                                                    │
│                                                                      │
│  MINIMAL INTERRUPTION - Only moving partitions stop                  │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Deep Dive

1. Partition Assignment Strategies

JAVA(72 lines)
Code
Loading syntax highlighter...

Assignment Strategy Comparison

ASSIGNMENT STRATEGY COMPARISON:

┌──────────────────┬─────────────────┬─────────────────┬─────────────────┐
│                  │  Distribution   │   Rebalance     │   Use Case      │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Range            │ Per-topic       │ Eager           │ Co-partitioned  │
│                  │ (uneven for     │ (stop-the-      │ topics          │
│                  │ odd partitions) │ world)          │                 │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ RoundRobin       │ Even across     │ Eager           │ Single topic    │
│                  │ all topics      │                 │ or unrelated    │
│                  │                 │                 │ topics          │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Sticky           │ Even, preserves │ Eager           │ Stateful        │
│                  │ assignments     │                 │ consumers       │
├──────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Cooperative      │ Even, preserves │ Incremental     │ RECOMMENDED     │
│ Sticky           │ assignments     │ (no stop-the-   │ for most        │
│                  │                 │ world)          │ production      │
└──────────────────┴─────────────────┴─────────────────┴─────────────────┘

2. Static Membership

JAVA(25 lines)
Code
Loading syntax highlighter...

Static vs Dynamic Membership

DYNAMIC MEMBERSHIP (Default):

Consumer restarts:
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  T=0: Consumer-1 running with partitions [0,1,2]                   │
│       member.id = "consumer-uuid-abc123"                           │
│                                                                    │
│  T=1: Consumer-1 restarts (deployment)                             │
│       Session expires → Rebalance triggered                        │
│                                                                    │
│  T=2: Consumer-1 rejoins                                           │
│       NEW member.id = "consumer-uuid-xyz789"                       │
│       Coordinator doesn't recognize it → Full rebalance            │
│                                                                    │
│  Result: Two rebalances (leave + join)                             │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘


STATIC MEMBERSHIP:

Consumer restarts:
┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  T=0: Consumer-1 running with partitions [0,1,2]                   │
│       group.instance.id = "order-processor-host1"                  │
│                                                                    │
│  T=1: Consumer-1 restarts (deployment)                             │
│       Coordinator knows instance.id, waits for session.timeout.ms  │
│                                                                    │
│  T=2: Consumer-1 rejoins (within session timeout)                  │
│       Same group.instance.id = "order-processor-host1"             │
│       Coordinator: "Welcome back! Here are your old partitions"    │
│                                                                    │
│  Result: No rebalance! Same partitions restored                    │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘


When static membership triggers rebalance:
• Consumer doesn't rejoin within session.timeout.ms
• Consumer explicitly leaves (not restart)
• New consumer joins with new group.instance.id

3. Rebalance Listeners

JAVA(80 lines)
Code
Loading syntax highlighter...

4. Spring Kafka Concurrency

JAVA(44 lines)
Code
Loading syntax highlighter...

5. Reducing Rebalance Impact

JAVA(35 lines)
Code
Loading syntax highlighter...

6. Monitoring Rebalances

JAVA(46 lines)
Code
Loading syntax highlighter...
YAML(20 lines)
Code
Loading syntax highlighter...

Common Mistakes

Mistake 1: Not Using Cooperative Rebalancing

JAVA(7 lines)
Code
Loading syntax highlighter...

Mistake 2: Short Session Timeout with Static Membership

JAVA(9 lines)
Code
Loading syntax highlighter...

Mistake 3: Non-Unique Static Member IDs

JAVA(12 lines)
Code
Loading syntax highlighter...

Mistake 4: Ignoring Rebalance Listeners

JAVA(24 lines)
Code
Loading syntax highlighter...

Mistake 5: Too Many Consumers

JAVA(7 lines)
Code
Loading syntax highlighter...

Debug This

Scenario: Constant Rebalancing

Symptoms:
  • Consumer group never stays stable
  • Continuous "Preparing rebalance" state
  • Messages being processed multiple times
Investigation:
BASH(16 lines)
Code
Loading syntax highlighter...
JAVA(16 lines)
Code
Loading syntax highlighter...
Common Causes:
  1. Consumer crashing: Check for OOM, exceptions
  2. Poll timeout: Processing too slow
  3. Mixed strategies: Consumers with different assignors
  4. Network issues: Heartbeat failures
  5. Frequent deployments: Without static membership
Resolution:
JAVA(13 lines)
Code
Loading syntax highlighter...

Exercises

Exercise 1: Compare Assignment Strategies

Create a test that:

  1. Sets up topic with 7 partitions
  2. Uses 3 consumers
  3. Compares Range vs RoundRobin vs Sticky assignment
  4. Documents which partitions each consumer gets

Exercise 2: Measure Rebalance Duration

Build a benchmark that:

  1. Measures time from first revoke to last assign
  2. Compares eager vs cooperative rebalancing
  3. Tests with 10, 50, 100 consumers
  4. Graphs the results

Exercise 3: Implement State Checkpoint

Create a stateful consumer that:

  1. Maintains window aggregations per partition
  2. Checkpoints state on rebalance
  3. Restores state on partition assignment
  4. Handles partition loss gracefully

Exercise 4: Rolling Deployment Simulator

Build a test that:

  1. Simulates 10-consumer group
  2. Performs rolling restart (one at a time)
  3. Measures message processing continuity
  4. Compares with and without static membership

Exercise 5: Rebalance Dashboard

Create a monitoring dashboard showing:

  1. Current partition assignment
  2. Rebalance history (last 24 hours)
  3. Duration of each rebalance
  4. Consumer join/leave events

Interview Questions

Q1: Explain the difference between eager and cooperative rebalancing.

A: The key difference is in how partitions are revoked:
Eager rebalancing (traditional):
  • All partitions revoked from all consumers at once
  • "Stop-the-world" event
  • No processing during rebalance
  • Simple but disruptive
Cooperative rebalancing (incremental):
  • Only affected partitions are revoked
  • Non-affected consumers continue processing
  • Two-phase: first revoke, then assign
  • More complex but minimal disruption
Example: When adding one consumer to a group of 3:

Eager: All 3 consumers stop → reassign all partitions → all resume Cooperative: Identify which partitions move → only those stop → assign to new consumer

Cooperative is recommended for production because it minimizes processing interruption.

Q2: What is static membership and when should you use it?

A: Static membership assigns a persistent identity to consumers via group.instance.id:
How it works:
  • Consumer registers with a stable ID instead of dynamic UUID
  • Coordinator tracks instance ID across sessions
  • If consumer restarts with same ID within session timeout, it rejoins without rebalance
When to use:
  1. Kubernetes deployments: Rolling updates cause frequent restarts
  2. Stateful consumers: Need to maintain partition affinity
  3. Large groups: Rebalances are expensive with many consumers
  4. Long session timeout acceptable: Must handle actual failures being detected slowly
Configuration:
JAVA(2 lines)
Code
Loading syntax highlighter...
Trade-off: Slower failure detection (must wait for session timeout) vs. fewer spurious rebalances

Q3: How does the group coordinator work?

A: The group coordinator is a broker responsible for managing a specific consumer group:
Selection:
  • Group ID hashed to determine __consumer_offsets partition
  • Leader of that partition becomes coordinator
  • Formula: hash(group_id) % 50 (50 partitions by default)
Responsibilities:
  1. Accept heartbeats from consumers
  2. Detect consumer failures (heartbeat timeout)
  3. Handle JoinGroup and SyncGroup requests
  4. Trigger rebalances when membership changes
  5. Store committed offsets
Rebalance orchestration:
  1. Coordinator receives trigger (join/leave/timeout)
  2. Notifies all consumers to rejoin
  3. Receives JoinGroup from all members
  4. Sends member list to leader
  5. Receives assignment from leader
  6. Distributes assignment via SyncGroup

Q4: How would you handle a scenario where rebalances are causing duplicate message processing?

A: Duplicates during rebalance happen when:
  1. Consumer processes message
  2. Rebalance revokes partition before commit
  3. New owner reprocesses from last committed offset
Solutions:
  1. Commit more frequently:
JAVA(2 lines)
Code
Loading syntax highlighter...
  1. Commit on rebalance:
JAVA(4 lines)
Code
Loading syntax highlighter...
  1. Idempotent processing:
JAVA(6 lines)
Code
Loading syntax highlighter...
  1. Transactional consumers (for exactly-once):
JAVA(2 lines)
Code
Loading syntax highlighter...

Q5: What factors affect how long a rebalance takes?

A: Several factors contribute to rebalance duration:
Group size:
  • More consumers = more JoinGroup/SyncGroup messages
  • O(n) network round trips
Partition count:
  • More partitions = larger assignment calculation
  • More data in SyncGroup response
Assignment strategy complexity:
  • Range: O(partitions)
  • Sticky: O(partitions × consumers) for optimal assignment
Network latency:
  • JoinGroup must reach all consumers
  • SyncGroup response sent to all
  • Retries if messages lost
Consumer processing:
  • Slow onPartitionsRevoked callback
  • State checkpoint/flush delays
Optimization strategies:
  1. Use cooperative rebalancing (reduces scope)
  2. Use static membership (avoids restart rebalances)
  3. Keep rebalance callbacks fast
  4. Right-size the consumer group
  5. Monitor and tune session.timeout.ms

Summary

Key Takeaways

  1. Consumer groups coordinate via a group coordinator broker that manages membership and triggers rebalances
  2. Rebalances are triggered by member joins, leaves, failures, or subscription changes
  3. Cooperative rebalancing (incremental) is far better than eager (stop-the-world) for production
  4. Static membership prevents unnecessary rebalances during rolling deployments
  5. Assignment strategies determine how partitions are distributed - CooperativeStickyAssignor is recommended
  6. Rebalance listeners are essential for stateful consumers to checkpoint state
  7. Concurrency should match partition count - extra consumers are idle
  8. Monitor rebalance frequency - frequent rebalancing indicates configuration or infrastructure issues

Quick Reference

PROPERTIES(10 lines)
Code
Loading syntax highlighter...

Rebalance Triggers

TriggerDescription
New consumer joinsGroup membership increased
Consumer leavesGraceful shutdown
Consumer failsHeartbeat timeout
Subscription changeTopics added/removed
Partition count changeTopic partitions modified

Assignment Strategies

StrategyRebalanceBest For
RangeEagerCo-partitioned topics
RoundRobinEagerEven distribution
StickyEagerMinimize movement
CooperativeStickyAssignorCooperativeProduction default

Series Navigation

PreviousCurrentNext
Part 8: Consumer InternalsPart 9: Consumer GroupsPart 10: Offset Management

Series Overview

  • Part 0: How to Use This Series
  • Parts 1-4: Fundamentals
  • Parts 5-7: Producers
  • Parts 8-11: Consumers (Internals, Groups, Offset Management, Exactly-Once)
  • Parts 12-14: Operations
  • Parts 15-17: Kafka Streams
  • Parts 18-20: Patterns & Practices
  • Part 21: Cheatsheet & Decision Guide