Error Handling & Dead Letter Queues
At a Glance
| Aspect | Details |
|---|---|
| Goal | Handle all failure scenarios without losing data |
| Error Types | Transient (retry), Permanent (DLQ), Poison (skip/quarantine) |
| Retry Strategies | Immediate, fixed delay, exponential backoff |
| Dead Letter Queue | Parking lot for unprocessable messages |
| Spring Integration | DefaultErrorHandler, RetryTemplate, DLT |
| Kafka Streams | DeserializationExceptionHandler, ProductionExceptionHandler |
| Prerequisites | Parts 1-18 (core Kafka and patterns) |
What You'll Learn
- Categorizing errors: transient vs permanent vs poison
- Implementing retry with exponential backoff
- Dead letter queues for unprocessable messages
- Handling deserialization failures
- Error handling in Kafka Streams
- Monitoring and alerting on errors
- Recovery and replay strategies
Production Story: The Midnight Cascade
"Consumer lag is growing exponentially," the alert read. A payment processing consumer was stuck. The error logs showed the same message being processed repeatedly:
ERROR Processing payment for order ORD-12345 ERROR Processing payment for order ORD-12345 ERROR Processing payment for order ORD-12345 ... (thousands more)
After the incident, the team implemented proper error handling:
JAVA(15 lines)CodeLoading syntax highlighter...
Now poison messages are retried 3 times, then moved to a dead letter queue for investigation—without blocking other messages.
Mental Model: Error Classification
┌──────────────────────────────────────────────────────────────────────────┐ │ ERROR CLASSIFICATION │ └──────────────────────────────────────────────────────────────────────────┘ ┌─────────────────┐ │ ERROR │ │ OCCURS │ └────────┬────────┘ │ ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ TRANSIENT │ │ PERMANENT │ │ POISON │ │ │ │ │ │ │ │ • Network │ │ • Validation │ │ • Malformed │ │ • Timeout │ │ • Business │ │ • Corrupt │ │ • Rate limit │ │ rule fail │ │ • Incompatible│ │ • DB overload │ │ • Auth denied │ │ schema │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ RETRY │ │ DLQ │ │ DLQ │ │ │ │ │ │ + ALERT │ │ Exponential │ │ Investigate │ │ │ │ backoff │ │ later │ │ May need │ │ Max attempts │ │ │ │ manual fix │ └───────────────┘ └───────────────┘ └───────────────┘
Deep Dive: Spring Kafka Error Handling
Basic Error Handler Configuration
JAVA(44 lines)CodeLoading syntax highlighter...
Exponential Backoff
JAVA(28 lines)CodeLoading syntax highlighter...
Using @RetryableTopic (Recommended)
JAVA(51 lines)CodeLoading syntax highlighter...
Retry Topic Flow
┌──────────────────────────────────────────────────────────────────────────┐ │ @RETRYABLETOPIC FLOW │ └──────────────────────────────────────────────────────────────────────────┘ payments (main topic) │ ┌──────┴──────┐ │ Process │ │ message │ └──────┬──────┘ │ Success? ──Yes──▶ Done ✓ │ No │ ▼ payments-retry-0 (after 1s delay) │ ┌──────┴──────┐ │ Retry #1 │ └──────┬──────┘ │ Success? ──Yes──▶ Done ✓ │ No │ ▼ payments-retry-1 (after 2s delay) │ ┌──────┴──────┐ │ Retry #2 │ └──────┬──────┘ │ Success? ──Yes──▶ Done ✓ │ No │ ▼ payments-retry-2 (after 4s delay) │ ┌──────┴──────┐ │ Retry #3 │ └──────┬──────┘ │ Success? ──Yes──▶ Done ✓ │ No │ ▼ payments-dlt (Dead Letter Topic) │ ┌──────┴──────┐ │ @DltHandler │ │ Log, Alert │ │ Store │ └─────────────┘
Deep Dive: Custom Error Classification
Smart Error Classifier
JAVA(64 lines)CodeLoading syntax highlighter...
Custom Error Handler
JAVA(65 lines)CodeLoading syntax highlighter...
Deep Dive: Handling Deserialization Errors
The Poison Pill Problem
┌──────────────────────────────────────────────────────────────────────────┐ │ DESERIALIZATION ERROR FLOW │ └──────────────────────────────────────────────────────────────────────────┘ Producer sends malformed message (e.g., wrong schema) │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Kafka Broker │ │ (doesn't validate - just stores bytes) │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ Consumer tries to deserialize │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ DeserializationException! │ │ │ │ Your @KafkaListener NEVER GETS CALLED │ │ Normal error handling doesn't help! │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ Default behavior: Infinite retry loop (consumer stuck forever)
ErrorHandlingDeserializer
JAVA(31 lines)CodeLoading syntax highlighter...
Handling Deserialization Exceptions
JAVA(63 lines)CodeLoading syntax highlighter...
Deep Dive: Kafka Streams Error Handling
Deserialization Exception Handler
JAVA(75 lines)CodeLoading syntax highlighter...
Production Exception Handler
JAVA(29 lines)CodeLoading syntax highlighter...
Stream Processing Error Handling
JAVA(84 lines)CodeLoading syntax highlighter...
Deep Dive: DLQ Management
DLQ Consumer and Reprocessing
JAVA(101 lines)CodeLoading syntax highlighter...
DLQ REST API
JAVA(81 lines)CodeLoading syntax highlighter...
Deep Dive: Monitoring Errors
Error Metrics
JAVA(74 lines)CodeLoading syntax highlighter...
Alerting Rules
YAML(46 lines)CodeLoading syntax highlighter...
Common Mistakes
1. No Error Classification
JAVA(20 lines)CodeLoading syntax highlighter...
2. Unbounded Retries
JAVA(7 lines)CodeLoading syntax highlighter...
3. Not Handling Deserialization Errors
JAVA(8 lines)CodeLoading syntax highlighter...
4. Losing Error Context in DLQ
JAVA(13 lines)CodeLoading syntax highlighter...
Debug This: The Zombie Consumer
A consumer keeps reprocessing the same message:
10:00:00 Processing order ORD-12345 10:00:01 ERROR: Connection timeout to inventory service 10:00:02 Processing order ORD-12345 10:00:03 ERROR: Connection timeout to inventory service ... (continues for hours)
Inventory service was down for maintenance. What went wrong?
Answer
- No circuit breaker - Consumer keeps trying a known-dead service
- No retry limit - Infinite retries on transient error
- Blocking other messages - One bad message blocks the partition
-
Add retry limits:JAVA(5 lines)CodeLoading syntax highlighter...
-
Add circuit breaker:JAVA(9 lines)CodeLoading syntax highlighter...
-
Consider non-blocking retry topics:
- Failed messages go to retry topic
- Other messages continue processing
- Retry topic processed separately
-
Add timeout awareness:JAVA(4 lines)CodeLoading syntax highlighter...
The fundamental problem was treating a systemic issue (service down) like an individual message failure.
Exercises
Exercise 1: Implement Smart Retry
Create a retry strategy that:
- Uses exponential backoff
- Classifies errors (transient vs permanent)
- Routes to different DLQ topics based on error type
- Records metrics for each
Exercise 2: Build DLQ Dashboard
Create a management UI with:
- List of failed messages with filtering
- Message details view with payload and error
- Reprocess and resolve actions
- Statistics and charts
Exercise 3: Handle Schema Evolution Errors
Implement handling for:
- Old consumer receiving new schema messages
- Graceful degradation (process what you can)
- Alert on unknown fields
- Quarantine incompatible messages
Exercise 4: Kafka Streams Error Recovery
Build a Streams app with:
- Custom deserialization handler
- Branch-based error routing
- State store for failed records
- Recovery processor
Exercise 5: End-to-End Error Monitoring
Set up:
- Prometheus metrics for all error types
- Grafana dashboard
- AlertManager rules
- PagerDuty integration for critical errors
Interview Questions
Q1: "How do you handle a poison message that's blocking your consumer?"
- Consumer stuck on same offset
- Error logs showing repeated failures
- Growing lag for single partition
BASH(7 lines)CodeLoading syntax highlighter...
JAVA(2 lines)CodeLoading syntax highlighter...
- Use ErrorHandlingDeserializer for all consumers
- Set max retry attempts (not unlimited)
- Configure DLQ for all listeners
- Schema validation where possible
The key is having error handling configured BEFORE you encounter poison messages."
Q2: "What's the difference between retry topics and the DLQ?"
main-topic → retry-topic-1 → retry-topic-2 → DLQ - Temporary holding for transient failures - Processed automatically after delay - Each retry topic has increasing delay - Message expected to succeed eventually
- Final destination for unprocessable messages - Requires manual investigation - Messages won't be auto-reprocessed - Stores error context (why it failed)
Retry topics for:
- Network timeouts
- Database connection issues
- Rate limiting
- Temporary service unavailability
DLQ for:
- Validation failures
- Business rule violations
- Data format issues
- Exceeded all retries
JAVA(17 lines)CodeLoading syntax highlighter...
- Error rate > threshold (e.g., > 1% of messages)
- DLQ growth rate
- Consumer lag growing + errors
- Deserialization errors (schema issue)
- Retries exhausted rate
- Error rate by topic
- DLQ depth and growth
- Retry distribution (how many retries before success/failure)
- Error type breakdown
- Structured logs with correlation IDs
- Error context (key, partition, offset)
- Stack traces for investigation
- Don't log message payload (PII)
- Store failed messages in database
- UI for investigation and reprocessing
- Automatic retry for specific error types
- Aging alerts (old unresolved messages)
The goal is to know about problems before users do, with enough context to diagnose quickly."
Q4: "How do you design a system that can recover from extended downstream outages?"
JAVA(3 lines)CodeLoading syntax highlighter...
JAVA(2 lines)CodeLoading syntax highlighter...
main-topic → (fail) → retry-topic-30s → (fail) → retry-topic-5m → DLQ
Other messages continue while failed ones wait.
- Separate consumer groups for different criticality
- Different retry policies per topic
- Critical messages have more retries
- Store for later processing (database)
- Partial processing (do what you can)
- Async notification when service recovers
JAVA(6 lines)CodeLoading syntax highlighter...
The key is accepting that downstream services WILL fail and designing for it upfront."
Q5: "How do you handle errors in exactly-once processing?"
1. Read from input topic 2. Process (may fail here) 3. Write to output topic + commit offset (atomic)
JAVA(4 lines)CodeLoading syntax highlighter...
JAVA(15 lines)CodeLoading syntax highlighter...
JAVA(6 lines)CodeLoading syntax highlighter...
The key insight: exactly-once guarantees atomicity of read-process-write, but you still need strategies for what to do when processing fails."
Summary & Key Takeaways
Error Handling Summary
┌──────────────────────────────────────────────────────────────────────────┐ │ ERROR HANDLING SUMMARY │ ├────────────────────────────────┬─────────────────────────────────────────┤ │ Error Type │ Handling Strategy │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ Transient (timeout, etc.) │ Retry with exponential backoff │ │ Permanent (validation, etc.) │ Send to DLQ immediately │ │ Poison (deserialization) │ Quarantine, alert │ │ Unknown │ DLQ + alert, investigate │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ Component │ Configuration │ ├────────────────────────────────┼─────────────────────────────────────────┤ │ Spring Kafka │ @RetryableTopic, DefaultErrorHandler │ │ Kafka Streams │ DeserializationExceptionHandler │ │ Deserialization │ ErrorHandlingDeserializer │ │ Monitoring │ Metrics + alerts + DLQ dashboard │ └────────────────────────────────┴─────────────────────────────────────────┘
Essential Takeaways
-
Classify errors - Transient errors should retry, permanent errors should DLQ.
-
Always have retry limits - Infinite retries = stuck consumer.
-
Handle deserialization separately - Use ErrorHandlingDeserializer or you'll have poison pills.
-
Preserve error context - Include original topic, error message, stack trace in DLQ records.
-
Monitor and alert - Know about DLQ growth before it becomes a problem.
-
Have a DLQ strategy - UI for investigation, reprocessing capability, aging alerts.
Quick Reference
┌───────────────────────────────────────────────────────────────────────────┐ │ ERROR HANDLING QUICK REFERENCE │ ├────────────────────────────────┬──────────────────────────────────────────┤ │ Configuration │ Code/Property │ ├────────────────────────────────┼──────────────────────────────────────────┤ │ Spring retry topic │ @RetryableTopic(attempts=4, backoff=...) │ │ DLQ handler │ @DltHandler │ │ Error deserializer │ ErrorHandlingDeserializer │ │ Error classifier │ DefaultErrorHandler.addNotRetryable() │ │ Exponential backoff │ ExponentialBackOff(initial, multiplier) │ ├────────────────────────────────┼──────────────────────────────────────────┤ │ Kafka Streams │ │ ├────────────────────────────────┼──────────────────────────────────────────┤ │ Deserialize handler │ DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER│ │ Production handler │ DEFAULT_PRODUCTION_EXCEPTION_HANDLER │ │ Options │ LogAndContinue, LogAndFail, Custom │ └────────────────────────────────┴──────────────────────────────────────────┘