Consumer Groups & Rebalancing
At a Glance
| Aspect | Details |
|---|---|
| Topic | Partition assignment, rebalancing strategies, cooperative vs eager |
| Complexity | Intermediate-Advanced |
| Prerequisites | Part 8 (Consumer Internals) |
| Time | 90 minutes |
| Spring Kafka | Container concurrency, rebalance listeners |
What You'll Learn
After completing this article, you will be able to:
- Understand how consumer groups coordinate partition assignment
- Choose between partition assignment strategies (range, round-robin, sticky, cooperative)
- Configure consumers to minimize rebalancing impact
- Implement rebalance listeners for stateful consumers
- Use static membership to eliminate unnecessary rebalances
Production Story: The Rebalancing Storm
The Incident
Our order processing cluster had 100 consumers processing from a topic with 200 partitions. One Monday morning, after a routine deployment, the entire system became unresponsive. Orders weren't being processed, and our monitoring showed consumers in a constant "Rebalancing" state.
The Investigation
BASH(11 lines)CodeLoading syntax highlighter...
The logs told the story:
09:00:00 [Consumer-1] Joined group, received assignment: [0,1,2,3,4,5] 09:00:01 [Consumer-2] Joined group, received assignment: [6,7,8,9,10,11] ... (100 consumers joining sequentially) 09:01:30 [Consumer-1] Heartbeat failed, consumer appears to have died 09:01:30 [Consumer-2] Rebalance triggered 09:01:31 [Consumer-1] Actually still alive! Rejoining... 09:01:32 [Consumer-3] Rebalance triggered by Consumer-1 rejoining ... (cascade of rebalances)
The problem visualization:
┌─────────────────────────────────────────────────────────────────────┐ │ THE REBALANCING STORM │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Initial state: 100 consumers, 200 partitions, EAGER protocol │ │ │ │ 09:00:00 - Deployment begins (rolling restart) │ │ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Consumer 1 restarts │ │ │ │ ↓ │ │ │ │ EAGER rebalance: ALL 100 consumers stop processing │ │ │ │ ↓ │ │ │ │ All 200 partitions revoked (stop-the-world!) │ │ │ │ ↓ │ │ │ │ Partition reassignment takes 30 seconds │ │ │ │ ↓ │ │ │ │ Consumer 1 gets new assignment │ │ │ │ ↓ │ │ │ │ Consumer 2 restarts (next in rolling update) │ │ │ │ ↓ │ │ │ │ ANOTHER rebalance: ALL consumers stop again! │ │ │ │ ↓ │ │ │ │ ... repeat for all 100 consumers ... │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ Result: 100 rebalances × 30 seconds each = 50 minutes of chaos │ │ Zero messages processed during this time │ │ │ │ Contributing factors: │ │ • Large group size (100 consumers) │ │ • EAGER rebalance protocol (stop-the-world) │ │ • Sequential deployment (each restart triggers rebalance) │ │ • No static membership (consumers get new IDs on restart) │ │ │ └─────────────────────────────────────────────────────────────────────┘
The Fix
JAVA(23 lines)CodeLoading syntax highlighter...
After the fix:
- Deployment time reduced from 50 minutes to 5 minutes
- Zero stop-the-world rebalances
- Continuous message processing during deployment
Mental Model: Consumer Group Coordination
┌─────────────────────────────────────────────────────────────────────────┐ │ CONSUMER GROUP ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ Consumer Group: "order-processors" │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ GROUP COORDINATOR │ │ │ │ (One broker is coordinator) │ │ │ │ │ │ │ │ Responsibilities: │ │ │ │ • Track group membership │ │ │ │ • Detect consumer failures (heartbeat timeout) │ │ │ │ • Manage __consumer_offsets partition │ │ │ │ • Trigger rebalances │ │ │ │ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ JoinGroup / SyncGroup │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ GROUP LEADER │ │ │ │ (First consumer to join becomes leader) │ │ │ │ │ │ │ │ Responsibilities: │ │ │ │ • Receive member list from coordinator │ │ │ │ • Run partition assignment algorithm │ │ │ │ • Send assignment back to coordinator │ │ │ │ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌────────────────┼────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │ │ Consumer 1 │ │ Consumer 2 │ │ Consumer 3 │ │ │ │ (LEADER) │ │ (MEMBER) │ │ (MEMBER) │ │ │ │ │ │ │ │ │ │ │ │ Partitions: │ │ Partitions: │ │ Partitions: │ │ │ │ [0, 1, 2] │ │ [3, 4, 5] │ │ [6, 7, 8] │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │ │ │ Topic: orders (9 partitions) │ │ ┌───┬───┬───┬───┬───┬───┬───┬───┬───┐ │ │ │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ │ │ └─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┘ │ │ │ │ │ │ │ │ │ │ │ │ │ └───┴───┼───┴───┴───┼───┴───┴───┘ │ │ │ │ │ │ C1 owns │ C2 owns │ C3 owns │ │ │ └─────────────────────────────────────────────────────────────────────────┘
Rebalance Protocol Flow
EAGER REBALANCE PROTOCOL (Stop-the-World): Consumer 4 joins the group │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 1: REVOKE ALL │ │ │ │ Coordinator → All Consumers: "Prepare for rebalance" │ │ │ │ C1: Stop processing, revoke partitions [0,1,2], commit offsets │ │ C2: Stop processing, revoke partitions [3,4,5], commit offsets │ │ C3: Stop processing, revoke partitions [6,7,8], commit offsets │ │ │ │ ALL PROCESSING STOPPED │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 2: JOIN GROUP │ │ │ │ All consumers (including new C4) send JoinGroup request │ │ Coordinator waits for all members │ │ Leader (C1) receives member list │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 3: SYNC GROUP │ │ │ │ Leader computes new assignment: │ │ C1 → [0,1] │ │ C2 → [2,3] │ │ C3 → [4,5] │ │ C4 → [6,7,8] │ │ │ │ Coordinator distributes assignments │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 4: RESUME │ │ │ │ All consumers receive assignments │ │ Resume processing on NEW partitions │ │ │ │ PROCESSING RESUMES │ │ │ └──────────────────────────────────────────────────────────────────────┘ COOPERATIVE REBALANCE (Incremental): Consumer 4 joins the group │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 1: PREPARE │ │ │ │ Coordinator → Leader: "C4 joining" │ │ Leader computes which partitions need to move │ │ │ │ EXISTING ASSIGNMENTS CONTINUE PROCESSING │ │ │ │ To move: [6,7,8] from C3 to C4 │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 2: FIRST REBALANCE - REVOKE ONLY │ │ │ │ C1: Keeps [0,1,2] - no interruption │ │ C2: Keeps [3,4,5] - no interruption │ │ C3: Revokes only [6,7,8], keeps nothing (will get new ones) │ │ C4: Gets nothing yet │ │ │ │ C1, C2 STILL PROCESSING │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ PHASE 3: SECOND REBALANCE - ASSIGN REVOKED │ │ │ │ C1: Keeps [0,1] - gives up [2] │ │ C2: Keeps [3,4] - gives up [5] │ │ C3: Gets [2,5] │ │ C4: Gets [6,7,8] │ │ │ │ MINIMAL INTERRUPTION - Only moving partitions stop │ │ │ └──────────────────────────────────────────────────────────────────────┘
Deep Dive
1. Partition Assignment Strategies
JAVA(72 lines)CodeLoading syntax highlighter...
Assignment Strategy Comparison
ASSIGNMENT STRATEGY COMPARISON: ┌──────────────────┬─────────────────┬─────────────────┬─────────────────┐ │ │ Distribution │ Rebalance │ Use Case │ ├──────────────────┼─────────────────┼─────────────────┼─────────────────┤ │ Range │ Per-topic │ Eager │ Co-partitioned │ │ │ (uneven for │ (stop-the- │ topics │ │ │ odd partitions) │ world) │ │ ├──────────────────┼─────────────────┼─────────────────┼─────────────────┤ │ RoundRobin │ Even across │ Eager │ Single topic │ │ │ all topics │ │ or unrelated │ │ │ │ │ topics │ ├──────────────────┼─────────────────┼─────────────────┼─────────────────┤ │ Sticky │ Even, preserves │ Eager │ Stateful │ │ │ assignments │ │ consumers │ ├──────────────────┼─────────────────┼─────────────────┼─────────────────┤ │ Cooperative │ Even, preserves │ Incremental │ RECOMMENDED │ │ Sticky │ assignments │ (no stop-the- │ for most │ │ │ │ world) │ production │ └──────────────────┴─────────────────┴─────────────────┴─────────────────┘
2. Static Membership
JAVA(25 lines)CodeLoading syntax highlighter...
Static vs Dynamic Membership
DYNAMIC MEMBERSHIP (Default): Consumer restarts: ┌────────────────────────────────────────────────────────────────────┐ │ │ │ T=0: Consumer-1 running with partitions [0,1,2] │ │ member.id = "consumer-uuid-abc123" │ │ │ │ T=1: Consumer-1 restarts (deployment) │ │ Session expires → Rebalance triggered │ │ │ │ T=2: Consumer-1 rejoins │ │ NEW member.id = "consumer-uuid-xyz789" │ │ Coordinator doesn't recognize it → Full rebalance │ │ │ │ Result: Two rebalances (leave + join) │ │ │ └────────────────────────────────────────────────────────────────────┘ STATIC MEMBERSHIP: Consumer restarts: ┌────────────────────────────────────────────────────────────────────┐ │ │ │ T=0: Consumer-1 running with partitions [0,1,2] │ │ group.instance.id = "order-processor-host1" │ │ │ │ T=1: Consumer-1 restarts (deployment) │ │ Coordinator knows instance.id, waits for session.timeout.ms │ │ │ │ T=2: Consumer-1 rejoins (within session timeout) │ │ Same group.instance.id = "order-processor-host1" │ │ Coordinator: "Welcome back! Here are your old partitions" │ │ │ │ Result: No rebalance! Same partitions restored │ │ │ └────────────────────────────────────────────────────────────────────┘ When static membership triggers rebalance: • Consumer doesn't rejoin within session.timeout.ms • Consumer explicitly leaves (not restart) • New consumer joins with new group.instance.id
3. Rebalance Listeners
JAVA(80 lines)CodeLoading syntax highlighter...
4. Spring Kafka Concurrency
JAVA(44 lines)CodeLoading syntax highlighter...
5. Reducing Rebalance Impact
JAVA(35 lines)CodeLoading syntax highlighter...
6. Monitoring Rebalances
JAVA(46 lines)CodeLoading syntax highlighter...
YAML(20 lines)CodeLoading syntax highlighter...
Common Mistakes
Mistake 1: Not Using Cooperative Rebalancing
JAVA(7 lines)CodeLoading syntax highlighter...
Mistake 2: Short Session Timeout with Static Membership
JAVA(9 lines)CodeLoading syntax highlighter...
Mistake 3: Non-Unique Static Member IDs
JAVA(12 lines)CodeLoading syntax highlighter...
Mistake 4: Ignoring Rebalance Listeners
JAVA(24 lines)CodeLoading syntax highlighter...
Mistake 5: Too Many Consumers
JAVA(7 lines)CodeLoading syntax highlighter...
Debug This
Scenario: Constant Rebalancing
- Consumer group never stays stable
- Continuous "Preparing rebalance" state
- Messages being processed multiple times
BASH(16 lines)CodeLoading syntax highlighter...
JAVA(16 lines)CodeLoading syntax highlighter...
- Consumer crashing: Check for OOM, exceptions
- Poll timeout: Processing too slow
- Mixed strategies: Consumers with different assignors
- Network issues: Heartbeat failures
- Frequent deployments: Without static membership
JAVA(13 lines)CodeLoading syntax highlighter...
Exercises
Exercise 1: Compare Assignment Strategies
Create a test that:
- Sets up topic with 7 partitions
- Uses 3 consumers
- Compares Range vs RoundRobin vs Sticky assignment
- Documents which partitions each consumer gets
Exercise 2: Measure Rebalance Duration
Build a benchmark that:
- Measures time from first revoke to last assign
- Compares eager vs cooperative rebalancing
- Tests with 10, 50, 100 consumers
- Graphs the results
Exercise 3: Implement State Checkpoint
Create a stateful consumer that:
- Maintains window aggregations per partition
- Checkpoints state on rebalance
- Restores state on partition assignment
- Handles partition loss gracefully
Exercise 4: Rolling Deployment Simulator
Build a test that:
- Simulates 10-consumer group
- Performs rolling restart (one at a time)
- Measures message processing continuity
- Compares with and without static membership
Exercise 5: Rebalance Dashboard
Create a monitoring dashboard showing:
- Current partition assignment
- Rebalance history (last 24 hours)
- Duration of each rebalance
- Consumer join/leave events
Interview Questions
Q1: Explain the difference between eager and cooperative rebalancing.
- All partitions revoked from all consumers at once
- "Stop-the-world" event
- No processing during rebalance
- Simple but disruptive
- Only affected partitions are revoked
- Non-affected consumers continue processing
- Two-phase: first revoke, then assign
- More complex but minimal disruption
Eager: All 3 consumers stop → reassign all partitions → all resume Cooperative: Identify which partitions move → only those stop → assign to new consumer
Cooperative is recommended for production because it minimizes processing interruption.
Q2: What is static membership and when should you use it?
group.instance.id:- Consumer registers with a stable ID instead of dynamic UUID
- Coordinator tracks instance ID across sessions
- If consumer restarts with same ID within session timeout, it rejoins without rebalance
- Kubernetes deployments: Rolling updates cause frequent restarts
- Stateful consumers: Need to maintain partition affinity
- Large groups: Rebalances are expensive with many consumers
- Long session timeout acceptable: Must handle actual failures being detected slowly
JAVA(2 lines)CodeLoading syntax highlighter...
Q3: How does the group coordinator work?
- Group ID hashed to determine
__consumer_offsetspartition - Leader of that partition becomes coordinator
- Formula:
hash(group_id) % 50(50 partitions by default)
- Accept heartbeats from consumers
- Detect consumer failures (heartbeat timeout)
- Handle JoinGroup and SyncGroup requests
- Trigger rebalances when membership changes
- Store committed offsets
- Coordinator receives trigger (join/leave/timeout)
- Notifies all consumers to rejoin
- Receives JoinGroup from all members
- Sends member list to leader
- Receives assignment from leader
- Distributes assignment via SyncGroup
Q4: How would you handle a scenario where rebalances are causing duplicate message processing?
- Consumer processes message
- Rebalance revokes partition before commit
- New owner reprocesses from last committed offset
- Commit more frequently:
JAVA(2 lines)CodeLoading syntax highlighter...
- Commit on rebalance:
JAVA(4 lines)CodeLoading syntax highlighter...
- Idempotent processing:
JAVA(6 lines)CodeLoading syntax highlighter...
- Transactional consumers (for exactly-once):
JAVA(2 lines)CodeLoading syntax highlighter...
Q5: What factors affect how long a rebalance takes?
- More consumers = more JoinGroup/SyncGroup messages
- O(n) network round trips
- More partitions = larger assignment calculation
- More data in SyncGroup response
- Range: O(partitions)
- Sticky: O(partitions × consumers) for optimal assignment
- JoinGroup must reach all consumers
- SyncGroup response sent to all
- Retries if messages lost
- Slow onPartitionsRevoked callback
- State checkpoint/flush delays
- Use cooperative rebalancing (reduces scope)
- Use static membership (avoids restart rebalances)
- Keep rebalance callbacks fast
- Right-size the consumer group
- Monitor and tune session.timeout.ms
Summary
Key Takeaways
-
Consumer groups coordinate via a group coordinator broker that manages membership and triggers rebalances
-
Rebalances are triggered by member joins, leaves, failures, or subscription changes
-
Cooperative rebalancing (incremental) is far better than eager (stop-the-world) for production
-
Static membership prevents unnecessary rebalances during rolling deployments
-
Assignment strategies determine how partitions are distributed - CooperativeStickyAssignor is recommended
-
Rebalance listeners are essential for stateful consumers to checkpoint state
-
Concurrency should match partition count - extra consumers are idle
-
Monitor rebalance frequency - frequent rebalancing indicates configuration or infrastructure issues
Quick Reference
Recommended Configuration
PROPERTIES(10 lines)CodeLoading syntax highlighter...
Rebalance Triggers
| Trigger | Description |
|---|---|
| New consumer joins | Group membership increased |
| Consumer leaves | Graceful shutdown |
| Consumer fails | Heartbeat timeout |
| Subscription change | Topics added/removed |
| Partition count change | Topic partitions modified |
Assignment Strategies
| Strategy | Rebalance | Best For |
|---|---|---|
| Range | Eager | Co-partitioned topics |
| RoundRobin | Eager | Even distribution |
| Sticky | Eager | Minimize movement |
| CooperativeStickyAssignor | Cooperative | Production default |
Series Navigation
| Previous | Current | Next |
|---|---|---|
| Part 8: Consumer Internals | Part 9: Consumer Groups | Part 10: Offset Management |
Series Overview
- Part 0: How to Use This Series
- Parts 1-4: Fundamentals
- Parts 5-7: Producers
- Parts 8-11: Consumers (Internals, Groups, Offset Management, Exactly-Once)
- Parts 12-14: Operations
- Parts 15-17: Kafka Streams
- Parts 18-20: Patterns & Practices
- Part 21: Cheatsheet & Decision Guide