Devops

Error Handling & Dead Letter Queues

At a Glance

AspectDetails
GoalHandle all failure scenarios without losing data
Error TypesTransient (retry), Permanent (DLQ), Poison (skip/quarantine)
Retry StrategiesImmediate, fixed delay, exponential backoff
Dead Letter QueueParking lot for unprocessable messages
Spring IntegrationDefaultErrorHandler, RetryTemplate, DLT
Kafka StreamsDeserializationExceptionHandler, ProductionExceptionHandler
PrerequisitesParts 1-18 (core Kafka and patterns)

What You'll Learn

  • Categorizing errors: transient vs permanent vs poison
  • Implementing retry with exponential backoff
  • Dead letter queues for unprocessable messages
  • Handling deserialization failures
  • Error handling in Kafka Streams
  • Monitoring and alerting on errors
  • Recovery and replay strategies

Production Story: The Midnight Cascade

Saturday, 2:00 AM. Pager goes off.

"Consumer lag is growing exponentially," the alert read. A payment processing consumer was stuck. The error logs showed the same message being processed repeatedly:

ERROR Processing payment for order ORD-12345
ERROR Processing payment for order ORD-12345
ERROR Processing payment for order ORD-12345
... (thousands more)
A single malformed payment message—missing a required field—was throwing an exception. The default error handling was retrying infinitely. Meanwhile, 50,000 legitimate payments were piling up behind it.
The quick fix was to manually seek past the bad message. But the underlying problem remained: no strategy for handling poison messages.

After the incident, the team implemented proper error handling:

JAVA(15 lines)
Code
Loading syntax highlighter...

Now poison messages are retried 3 times, then moved to a dead letter queue for investigation—without blocking other messages.

Mental Model: Error Classification

┌──────────────────────────────────────────────────────────────────────────┐
│                      ERROR CLASSIFICATION                                │
└──────────────────────────────────────────────────────────────────────────┘

                         ┌─────────────────┐
                         │    ERROR        │
                         │    OCCURS       │
                         └────────┬────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
      ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
      │   TRANSIENT   │   │   PERMANENT   │   │    POISON     │
      │               │   │               │   │               │
      │ • Network     │   │ • Validation  │   │ • Malformed   │
      │ • Timeout     │   │ • Business    │   │ • Corrupt     │
      │ • Rate limit  │   │   rule fail   │   │ • Incompatible│
      │ • DB overload │   │ • Auth denied │   │   schema      │
      └───────┬───────┘   └───────┬───────┘   └───────┬───────┘
              │                   │                   │
              ▼                   ▼                   ▼
      ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
      │    RETRY      │   │    DLQ        │   │    DLQ        │
      │               │   │               │   │    + ALERT    │
      │ Exponential   │   │ Investigate   │   │               │
      │ backoff       │   │ later         │   │ May need      │
      │ Max attempts  │   │               │   │ manual fix    │
      └───────────────┘   └───────────────┘   └───────────────┘

Deep Dive: Spring Kafka Error Handling

Basic Error Handler Configuration

JAVA(44 lines)
Code
Loading syntax highlighter...

Exponential Backoff

JAVA(28 lines)
Code
Loading syntax highlighter...
JAVA(51 lines)
Code
Loading syntax highlighter...

Retry Topic Flow

┌──────────────────────────────────────────────────────────────────────────┐
│                    @RETRYABLETOPIC FLOW                                  │
└──────────────────────────────────────────────────────────────────────────┘

                    payments (main topic)
                           │
                    ┌──────┴──────┐
                    │  Process    │
                    │  message    │
                    └──────┬──────┘
                           │
                     Success? ──Yes──▶ Done ✓
                           │
                          No
                           │
                           ▼
              payments-retry-0 (after 1s delay)
                           │
                    ┌──────┴──────┐
                    │  Retry #1   │
                    └──────┬──────┘
                           │
                     Success? ──Yes──▶ Done ✓
                           │
                          No
                           │
                           ▼
              payments-retry-1 (after 2s delay)
                           │
                    ┌──────┴──────┐
                    │  Retry #2   │
                    └──────┬──────┘
                           │
                     Success? ──Yes──▶ Done ✓
                           │
                          No
                           │
                           ▼
              payments-retry-2 (after 4s delay)
                           │
                    ┌──────┴──────┐
                    │  Retry #3   │
                    └──────┬──────┘
                           │
                     Success? ──Yes──▶ Done ✓
                           │
                          No
                           │
                           ▼
                   payments-dlt (Dead Letter Topic)
                           │
                    ┌──────┴──────┐
                    │ @DltHandler │
                    │ Log, Alert  │
                    │ Store       │
                    └─────────────┘

Deep Dive: Custom Error Classification

Smart Error Classifier

JAVA(64 lines)
Code
Loading syntax highlighter...

Custom Error Handler

JAVA(65 lines)
Code
Loading syntax highlighter...

Deep Dive: Handling Deserialization Errors

The Poison Pill Problem

┌──────────────────────────────────────────────────────────────────────────┐
│                    DESERIALIZATION ERROR FLOW                            │
└──────────────────────────────────────────────────────────────────────────┘

  Producer sends malformed message (e.g., wrong schema)
                    │
                    ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                      Kafka Broker                                   │
  │         (doesn't validate - just stores bytes)                      │
  └─────────────────────────────────────────────────────────────────────┘
                    │
                    ▼
  Consumer tries to deserialize
                    │
                    ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │              DeserializationException!                              │
  │                                                                     │
  │  Your @KafkaListener NEVER GETS CALLED                              │
  │  Normal error handling doesn't help!                                │
  └─────────────────────────────────────────────────────────────────────┘
                    │
                    ▼
  Default behavior: Infinite retry loop (consumer stuck forever)

ErrorHandlingDeserializer

JAVA(31 lines)
Code
Loading syntax highlighter...

Handling Deserialization Exceptions

JAVA(63 lines)
Code
Loading syntax highlighter...

Deep Dive: Kafka Streams Error Handling

Deserialization Exception Handler

JAVA(75 lines)
Code
Loading syntax highlighter...

Production Exception Handler

JAVA(29 lines)
Code
Loading syntax highlighter...

Stream Processing Error Handling

JAVA(84 lines)
Code
Loading syntax highlighter...

Deep Dive: DLQ Management

DLQ Consumer and Reprocessing

JAVA(101 lines)
Code
Loading syntax highlighter...

DLQ REST API

JAVA(81 lines)
Code
Loading syntax highlighter...

Deep Dive: Monitoring Errors

Error Metrics

JAVA(74 lines)
Code
Loading syntax highlighter...

Alerting Rules

YAML(46 lines)
Code
Loading syntax highlighter...

Common Mistakes

1. No Error Classification

JAVA(20 lines)
Code
Loading syntax highlighter...

2. Unbounded Retries

JAVA(7 lines)
Code
Loading syntax highlighter...

3. Not Handling Deserialization Errors

JAVA(8 lines)
Code
Loading syntax highlighter...

4. Losing Error Context in DLQ

JAVA(13 lines)
Code
Loading syntax highlighter...

Debug This: The Zombie Consumer

A consumer keeps reprocessing the same message:

10:00:00 Processing order ORD-12345
10:00:01 ERROR: Connection timeout to inventory service
10:00:02 Processing order ORD-12345
10:00:03 ERROR: Connection timeout to inventory service
... (continues for hours)

Inventory service was down for maintenance. What went wrong?

Answer
The issues:
  1. No circuit breaker - Consumer keeps trying a known-dead service
  2. No retry limit - Infinite retries on transient error
  3. Blocking other messages - One bad message blocks the partition
Solutions:
  1. Add retry limits:
    JAVA(5 lines)
    Code
    Loading syntax highlighter...
  2. Add circuit breaker:
    JAVA(9 lines)
    Code
    Loading syntax highlighter...
  3. Consider non-blocking retry topics:
    • Failed messages go to retry topic
    • Other messages continue processing
    • Retry topic processed separately
  4. Add timeout awareness:
    JAVA(4 lines)
    Code
    Loading syntax highlighter...

The fundamental problem was treating a systemic issue (service down) like an individual message failure.

Exercises

Exercise 1: Implement Smart Retry

Create a retry strategy that:

  • Uses exponential backoff
  • Classifies errors (transient vs permanent)
  • Routes to different DLQ topics based on error type
  • Records metrics for each

Exercise 2: Build DLQ Dashboard

Create a management UI with:

  • List of failed messages with filtering
  • Message details view with payload and error
  • Reprocess and resolve actions
  • Statistics and charts

Exercise 3: Handle Schema Evolution Errors

Implement handling for:

  • Old consumer receiving new schema messages
  • Graceful degradation (process what you can)
  • Alert on unknown fields
  • Quarantine incompatible messages

Exercise 4: Kafka Streams Error Recovery

Build a Streams app with:

  • Custom deserialization handler
  • Branch-based error routing
  • State store for failed records
  • Recovery processor

Exercise 5: End-to-End Error Monitoring

Set up:

  • Prometheus metrics for all error types
  • Grafana dashboard
  • AlertManager rules
  • PagerDuty integration for critical errors

Interview Questions

Q1: "How do you handle a poison message that's blocking your consumer?"

What they're looking for: Practical problem-solving
Strong answer: "Poison messages need immediate identification and isolation:
Detection:
  • Consumer stuck on same offset
  • Error logs showing repeated failures
  • Growing lag for single partition
Immediate fix:
BASH(7 lines)
Code
Loading syntax highlighter...
Proper handling:
JAVA(2 lines)
Code
Loading syntax highlighter...
Prevention:
  • Use ErrorHandlingDeserializer for all consumers
  • Set max retry attempts (not unlimited)
  • Configure DLQ for all listeners
  • Schema validation where possible

The key is having error handling configured BEFORE you encounter poison messages."

Q2: "What's the difference between retry topics and the DLQ?"

What they're looking for: Understanding of error handling layers
Strong answer: "They serve different purposes in the error handling chain:
Retry Topics:
main-topic → retry-topic-1 → retry-topic-2 → DLQ

- Temporary holding for transient failures
- Processed automatically after delay
- Each retry topic has increasing delay
- Message expected to succeed eventually
Dead Letter Queue (DLQ):
- Final destination for unprocessable messages
- Requires manual investigation
- Messages won't be auto-reprocessed
- Stores error context (why it failed)
When messages go where:

Retry topics for:

  • Network timeouts
  • Database connection issues
  • Rate limiting
  • Temporary service unavailability

DLQ for:

  • Validation failures
  • Business rule violations
  • Data format issues
  • Exceeded all retries
@RetryableTopic creates both:
JAVA(17 lines)
Code
Loading syntax highlighter...
2. Key alerts:
  • Error rate > threshold (e.g., > 1% of messages)
  • DLQ growth rate
  • Consumer lag growing + errors
  • Deserialization errors (schema issue)
  • Retries exhausted rate
3. Dashboards:
  • Error rate by topic
  • DLQ depth and growth
  • Retry distribution (how many retries before success/failure)
  • Error type breakdown
4. Logging:
  • Structured logs with correlation IDs
  • Error context (key, partition, offset)
  • Stack traces for investigation
  • Don't log message payload (PII)
5. DLQ management:
  • Store failed messages in database
  • UI for investigation and reprocessing
  • Automatic retry for specific error types
  • Aging alerts (old unresolved messages)

The goal is to know about problems before users do, with enough context to diagnose quickly."

Q4: "How do you design a system that can recover from extended downstream outages?"

What they're looking for: Resilience thinking
Strong answer: "Several patterns combined:
1. Circuit breaker - Stop trying when service is known-down:
JAVA(3 lines)
Code
Loading syntax highlighter...
2. Retry with backoff - Increasing delays:
JAVA(2 lines)
Code
Loading syntax highlighter...
3. Non-blocking retry topics - Don't block partition:
main-topic → (fail) → retry-topic-30s → (fail) → retry-topic-5m → DLQ

Other messages continue while failed ones wait.

4. Bulkhead pattern - Isolate impacts:
  • Separate consumer groups for different criticality
  • Different retry policies per topic
  • Critical messages have more retries
5. Graceful degradation:
  • Store for later processing (database)
  • Partial processing (do what you can)
  • Async notification when service recovers
6. Recovery automation:
JAVA(6 lines)
Code
Loading syntax highlighter...

The key is accepting that downstream services WILL fail and designing for it upfront."

Q5: "How do you handle errors in exactly-once processing?"

What they're looking for: Transaction semantics understanding
Strong answer: "Exactly-once complicates error handling because we can't just retry blindly:
Challenge:
1. Read from input topic
2. Process (may fail here)
3. Write to output topic + commit offset (atomic)
If processing fails:
Transient error:
JAVA(4 lines)
Code
Loading syntax highlighter...
Permanent error:
JAVA(15 lines)
Code
Loading syntax highlighter...
Idempotency key pattern:
JAVA(6 lines)
Code
Loading syntax highlighter...

The key insight: exactly-once guarantees atomicity of read-process-write, but you still need strategies for what to do when processing fails."

Summary & Key Takeaways

Error Handling Summary

┌──────────────────────────────────────────────────────────────────────────┐
│                    ERROR HANDLING SUMMARY                                │
├────────────────────────────────┬─────────────────────────────────────────┤
│ Error Type                     │ Handling Strategy                       │
├────────────────────────────────┼─────────────────────────────────────────┤
│ Transient (timeout, etc.)      │ Retry with exponential backoff          │
│ Permanent (validation, etc.)   │ Send to DLQ immediately                 │
│ Poison (deserialization)       │ Quarantine, alert                       │
│ Unknown                        │ DLQ + alert, investigate                │
├────────────────────────────────┼─────────────────────────────────────────┤
│ Component                      │ Configuration                           │
├────────────────────────────────┼─────────────────────────────────────────┤
│ Spring Kafka                   │ @RetryableTopic, DefaultErrorHandler    │
│ Kafka Streams                  │ DeserializationExceptionHandler         │
│ Deserialization                │ ErrorHandlingDeserializer               │
│ Monitoring                     │ Metrics + alerts + DLQ dashboard        │
└────────────────────────────────┴─────────────────────────────────────────┘

Essential Takeaways

  1. Classify errors - Transient errors should retry, permanent errors should DLQ.
  2. Always have retry limits - Infinite retries = stuck consumer.
  3. Handle deserialization separately - Use ErrorHandlingDeserializer or you'll have poison pills.
  4. Preserve error context - Include original topic, error message, stack trace in DLQ records.
  5. Monitor and alert - Know about DLQ growth before it becomes a problem.
  6. Have a DLQ strategy - UI for investigation, reprocessing capability, aging alerts.

Quick Reference

┌───────────────────────────────────────────────────────────────────────────┐
│                    ERROR HANDLING QUICK REFERENCE                         │
├────────────────────────────────┬──────────────────────────────────────────┤
│ Configuration                  │ Code/Property                            │
├────────────────────────────────┼──────────────────────────────────────────┤
│ Spring retry topic             │ @RetryableTopic(attempts=4, backoff=...) │
│ DLQ handler                    │ @DltHandler                              │
│ Error deserializer             │ ErrorHandlingDeserializer                │
│ Error classifier               │ DefaultErrorHandler.addNotRetryable()    │
│ Exponential backoff            │ ExponentialBackOff(initial, multiplier)  │
├────────────────────────────────┼──────────────────────────────────────────┤
│ Kafka Streams                  │                                          │
├────────────────────────────────┼──────────────────────────────────────────┤
│ Deserialize handler            │ DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER│
│ Production handler             │ DEFAULT_PRODUCTION_EXCEPTION_HANDLER     │
│ Options                        │ LogAndContinue, LogAndFail, Custom       │
└────────────────────────────────┴──────────────────────────────────────────┘

Series Navigation

Series Overview: Kafka Compendium Series