Devops

Architecture & Storage Engine

πŸ“‹ At a Glance

AspectDetails
Difficulty🟑 Intermediate
PrerequisitesBasic Kafka concepts (topics, partitions)
Key ConceptsLog, segments, zero-copy, page cache, message format
Time Investment32 minutes read + 45 minutes practice
PayoffUnderstand why Kafka handles millions of messages/sec

🎯 What You'll Learn

After this article, you'll be able to:

  1. Explain the log abstraction and why it's perfect for messaging
  2. Understand segment files and how Kafka stores data on disk
  3. Describe zero-copy transfer and why it eliminates overhead
  4. Leverage the page cache for performance tuning
  5. Parse message format including headers, keys, and timestamps

πŸ”₯ Production Story: The Page Cache Mystery

The Setup: An e-commerce company ran a 5-broker Kafka cluster handling 500K messages/second. Performance was excellentβ€”until they deployed a new monitoring system.
The Symptoms:
Before monitoring: Producer latency p99 = 5ms
After monitoring:  Producer latency p99 = 150ms (30x worse!)

Throughput dropped from 500K to 50K msg/sec. No code changes to Kafka. Same hardware.

The Investigation:
BASH(8 lines)
Code
Loading syntax highlighter...

Read I/O jumped 50x! But Kafka consumers hadn't changed.

The Root Cause: The new monitoring system ran heavy queries against the OS metrics, consuming memory. This evicted Kafka's data from the page cache.

Normally, Kafka serves reads directly from page cache (RAM). When data was evicted, Kafka had to read from diskβ€”turning a memory operation into a disk operation.

The Fix:
BASH(3 lines)
Code
Loading syntax highlighter...
Lesson Learned: Kafka's performance depends heavily on the page cache. Anything that competes for memory affects Kafka. This is why Kafka servers should be dedicated.

🧠 Mental Model: Kafka's Storage Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    KAFKA STORAGE ARCHITECTURE                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚   PRODUCER                                                       β”‚
β”‚      β”‚                                                           β”‚
β”‚      β–Ό                                                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚                    KAFKA BROKER                          β”‚   β”‚
β”‚   β”‚                                                          β”‚   β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚   β”‚  β”‚              PAGE CACHE (RAM)                      β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ Recent  β”‚ β”‚ Recent  β”‚ β”‚ Recent  β”‚               β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ Segment β”‚ β”‚ Segment β”‚ β”‚ Segment β”‚               β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚   P0    β”‚ β”‚   P1    β”‚ β”‚   P2    β”‚               β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β”‚   β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚   β”‚                         β”‚                                β”‚   β”‚
β”‚   β”‚                   Zero-Copy                              β”‚   β”‚
β”‚   β”‚                    sendfile()                            β”‚   β”‚
β”‚   β”‚                         β”‚                                β”‚   β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚   β”‚  β”‚                    DISK                            β”‚  β”‚   β”‚
β”‚   β”‚  β”‚                                                    β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  Topic: orders                                     β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ Partition 0                                 β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ β”‚Segment β”‚ β”‚Segment β”‚ β”‚Segment β”‚ β”‚Active  β”‚ β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ β”‚  .log  β”‚ β”‚  .log  β”‚ β”‚  .log  β”‚ β”‚Segment β”‚ β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ β”‚0-1000  β”‚ β”‚1001-2000β”‚2001-3000β”‚ β”‚ 3001+  β”‚ β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚  β”‚   β”‚
β”‚   β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚   β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                         β”‚                                        β”‚
β”‚                    Zero-Copy                                     β”‚
β”‚                         β”‚                                        β”‚
β”‚                         β–Ό                                        β”‚
β”‚                      CONSUMER                                    β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Insight: Data flows from Producer β†’ Page Cache β†’ Disk
            Consumer reads: Page Cache (fast) or Disk (slow)
            Recent data is almost always in page cache!

πŸ”¬ Deep Dive

1. The Log Abstraction

At its core, Kafka is a distributed commit log. What does that mean?
A log is an append-only, ordered sequence of records:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         KAFKA LOG                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   Offset:  0      1      2      3      4      5      6     ...  β”‚
β”‚          β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”       β”‚
β”‚          β”‚ M0 β”‚ β”‚ M1 β”‚ β”‚ M2 β”‚ β”‚ M3 β”‚ β”‚ M4 β”‚ β”‚ M5 β”‚ β”‚ M6 β”‚ β†’     β”‚
β”‚          β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜       β”‚
β”‚                                                                 β”‚
β”‚   Properties:                                                   β”‚
β”‚   β€’ Append-only (writes go to the end)                          β”‚
β”‚   β€’ Ordered (offset determines order)                           β”‚
β”‚   β€’ Immutable (existing records never change)                   β”‚
β”‚   β€’ Persistent (survives restarts)                              β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Why is a log perfect for messaging?
  1. Sequential writes: Appending to the end is O(1), regardless of data size
  2. Sequential reads: Consumers read in order, which is disk-friendly
  3. Simplicity: No complex data structures, just files
  4. Durability: Once written, data stays until explicitly deleted
Traditional message brokers vs Kafka:
AspectTraditional (e.g., RabbitMQ)Kafka
Data structureQueue (FIFO, delete on read)Log (append-only)
Message deliveryPush to consumerConsumer pulls
Message retentionUntil acknowledgedTime or size based
Random accessNoYes (by offset)
Replay capabilityNoYes

2. Segments: How Data Lives on Disk

A partition isn't stored as one giant file. It's split into segments:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PARTITION SEGMENTS                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   /var/kafka/data/orders-0/                                     β”‚
β”‚   β”‚                                                             β”‚
β”‚   β”œβ”€β”€ 00000000000000000000.log      (offsets 0-999)             β”‚
β”‚   β”œβ”€β”€ 00000000000000000000.index    (offset β†’ position)         β”‚
β”‚   β”œβ”€β”€ 00000000000000000000.timeindex (timestamp β†’ offset)       β”‚
β”‚   β”‚                                                             β”‚
β”‚   β”œβ”€β”€ 00000000000000001000.log      (offsets 1000-1999)         β”‚
β”‚   β”œβ”€β”€ 00000000000000001000.index                                β”‚
β”‚   β”œβ”€β”€ 00000000000000001000.timeindex                            β”‚
β”‚   β”‚                                                             β”‚
β”‚   β”œβ”€β”€ 00000000000000002000.log      (offsets 2000-2999)         β”‚
β”‚   β”œβ”€β”€ 00000000000000002000.index                                β”‚
β”‚   β”œβ”€β”€ 00000000000000002000.timeindex                            β”‚
β”‚   β”‚                                                             β”‚
β”‚   └── 00000000000000003000.log      ← ACTIVE SEGMENT            β”‚
β”‚       00000000000000003000.index      (writes go here)          β”‚
β”‚       00000000000000003000.timeindex                            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Why segments?
  1. Efficient deletion: Delete whole segment files, not individual messages
  2. Efficient compaction: Process segment by segment
  3. Parallel I/O: Different segments can be read simultaneously
  4. Memory mapping: Smaller files are easier to mmap
Segment configuration:
PROPERTIES(7 lines)
Code
Loading syntax highlighter...
The index files:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INDEX FILE STRUCTURE                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   .index file (offset β†’ file position)                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚  Offset  β”‚  Position in .log file                        β”‚  β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚   β”‚  0       β”‚  0                                            β”‚  β”‚
β”‚   β”‚  50      β”‚  4096                                         β”‚  β”‚
β”‚   β”‚  100     β”‚  8192                                         β”‚  β”‚
β”‚   β”‚  150     β”‚  12288                                        β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β”‚   Index is sparse! Not every offset is indexed.                 β”‚
β”‚   To find offset 75:                                            β”‚
β”‚   1. Binary search index β†’ find entry ≀ 75 (offset 50)          β”‚
β”‚   2. Seek to position 4096 in .log                              β”‚
β”‚   3. Scan forward to find offset 75                             β”‚
β”‚                                                                 β”‚
β”‚   .timeindex file (timestamp β†’ offset)                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚  Timestamp         β”‚  Offset                             β”‚  β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚   β”‚  1705312800000     β”‚  0                                  β”‚  β”‚
β”‚   β”‚  1705312860000     β”‚  1000                               β”‚  β”‚
β”‚   β”‚  1705312920000     β”‚  2000                               β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β”‚   Used for: offsetsForTimes() API                               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Zero-Copy Transfer

This is one of Kafka's biggest performance secrets.

Traditional data transfer (4 copies!):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 TRADITIONAL DATA TRANSFER                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   DISK                                                          β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 1. read() - DMA copy to kernel buffer                     β”‚
β”‚     β–Ό                                                           β”‚
β”‚   KERNEL BUFFER (Page Cache)                                    β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 2. CPU copy to application buffer                         β”‚
β”‚     β–Ό                                                           β”‚
β”‚   APPLICATION BUFFER (JVM Heap)                                 β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 3. CPU copy to socket buffer                              β”‚
β”‚     β–Ό                                                           β”‚
β”‚   SOCKET BUFFER (Kernel)                                        β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 4. DMA copy to NIC                                        β”‚
β”‚     β–Ό                                                           β”‚
β”‚   NETWORK                                                       β”‚
β”‚                                                                 β”‚
β”‚   Total: 4 copies, 2 kernel-user context switches               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Kafka's zero-copy transfer (sendfile):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ZERO-COPY TRANSFER                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   DISK                                                          β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 1. DMA copy to kernel buffer                              β”‚
β”‚     β–Ό                                                           β”‚
β”‚   KERNEL BUFFER (Page Cache)                                    β”‚
β”‚     β”‚                                                           β”‚
β”‚     β”‚ 2. sendfile() - DMA scatter/gather to NIC                 β”‚
β”‚     β–Ό                                                           β”‚
β”‚   NETWORK                                                       β”‚
β”‚                                                                 β”‚
β”‚   Total: 2 copies (both DMA, no CPU involved)                   β”‚
β”‚          0 kernel-user context switches                         β”‚
β”‚          Data never enters JVM heap!                            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
The Java code that enables this:
JAVA(6 lines)
Code
Loading syntax highlighter...
Why this matters:
MetricTraditionalZero-Copy
CPU usageHighMinimal
Memory copies42
Context switches20
ThroughputLimited by CPULimited by NIC

4. The Page Cache: Kafka's Secret Weapon

The page cache is OS-managed memory that caches disk data:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      PAGE CACHE OPERATION                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚   WRITE PATH (Producer β†’ Broker)                                 β”‚
β”‚                                                                  β”‚
β”‚   Producer sends message                                         β”‚
β”‚         β”‚                                                        β”‚
β”‚         β–Ό                                                        β”‚
β”‚   Broker writes to page cache  ← NOT directly to disk!           β”‚ 
β”‚         β”‚                                                        β”‚ 
β”‚         β”œβ”€β”€ Acknowledgment sent to producer                      β”‚
β”‚         β”‚   (if acks=1 or acks=all with ISR)                     β”‚
β”‚         β”‚                                                        β”‚
β”‚         β–Ό                                                        β”‚
β”‚   OS flushes to disk asynchronously (later)                      β”‚
β”‚                                                                  β”‚
β”‚   ─────────────────────────────────────────────────────────────  β”‚
β”‚                                                                  β”‚
β”‚   READ PATH (Broker β†’ Consumer)                                  β”‚
β”‚                                                                  β”‚
β”‚   Case 1: Data in page cache (FAST)                              β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚  Consumer requests offset 1000                          β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Check page cache β†’ HIT! (microseconds)                 β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Zero-copy to socket                                    β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β”‚   Case 2: Data not in page cache (SLOW)                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚  Consumer requests offset 1000                          β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Check page cache β†’ MISS                                β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Read from disk (milliseconds)                          β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Load into page cache                                   β”‚    β”‚
β”‚   β”‚         β”‚                                               β”‚    β”‚
β”‚   β”‚         β–Ό                                               β”‚    β”‚
β”‚   β”‚  Send to consumer                                       β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Why Kafka loves the page cache:
  1. Hot data stays in RAM: Recent messages (most commonly read) stay cached
  2. No JVM heap pressure: Page cache is outside JVM, no GC impact
  3. Survives restarts: JVM restart doesn't lose cached data (OS manages it)
  4. Automatic management: OS handles eviction, no tuning needed
Monitoring page cache:
BASH(14 lines)
Code
Loading syntax highlighter...
Page cache best practices:
BASH(14 lines)
Code
Loading syntax highlighter...

5. Message Format (Record Batch)

Messages in Kafka are stored in record batches:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RECORD BATCH FORMAT                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   Record Batch (v2, Kafka 0.11+)                                β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ BATCH HEADER (61 bytes)                                 β”‚   β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚   β”‚
β”‚   β”‚  β”‚ baseOffset         (8 bytes) - First offset         β”‚β”‚   β”‚
β”‚   β”‚  β”‚ batchLength        (4 bytes) - Size of batch        β”‚β”‚   β”‚
β”‚   β”‚  β”‚ partitionLeaderEpoch (4 bytes) - Leader version     β”‚β”‚   β”‚
β”‚   β”‚  β”‚ magic              (1 byte)  - Format version (2)   β”‚β”‚   β”‚
β”‚   β”‚  β”‚ crc                (4 bytes) - Checksum             β”‚β”‚   β”‚
β”‚   β”‚  β”‚ attributes         (2 bytes) - Compression, etc     β”‚β”‚   β”‚
β”‚   β”‚  β”‚ lastOffsetDelta    (4 bytes) - Offset range         β”‚β”‚   β”‚
β”‚   β”‚  β”‚ firstTimestamp     (8 bytes) - Batch start time     β”‚β”‚   β”‚
β”‚   β”‚  β”‚ maxTimestamp       (8 bytes) - Batch end time       β”‚β”‚   β”‚
β”‚   β”‚  β”‚ producerId         (8 bytes) - For idempotence      β”‚β”‚   β”‚
β”‚   β”‚  β”‚ producerEpoch      (2 bytes) - Producer version     β”‚β”‚   β”‚
β”‚   β”‚  β”‚ baseSequence       (4 bytes) - For ordering         β”‚β”‚   β”‚
β”‚   β”‚  β”‚ recordCount        (4 bytes) - Number of records    β”‚β”‚   β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚   β”‚
β”‚   β”‚                                                         β”‚   β”‚
β”‚   β”‚  RECORDS (variable length, compressed together)         β”‚   β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚   β”‚
β”‚   β”‚  β”‚  Record 0                                           β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ length (varint)                                β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ attributes (1 byte)                            β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ timestampDelta (varint)                        β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ offsetDelta (varint)                           β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ keyLength (varint)                             β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ key (bytes)                                    β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ valueLength (varint)                           β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ value (bytes)                                  β”‚β”‚   β”‚
β”‚   β”‚  β”‚  β”œβ”€β”€ headersCount (varint)                          β”‚β”‚   β”‚
β”‚   β”‚  β”‚  └── headers[] (key-value pairs)                    β”‚β”‚   β”‚
β”‚   β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-─│   β”‚
β”‚   β”‚  β”‚  Record 1                                           β”‚β”‚   β”‚
β”‚   β”‚  β”‚  ...                                                β”‚β”‚   β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Key insights about the format:
  1. Batching: Multiple records in one batch = fewer I/O operations
  2. Compression: Applied to whole batch, not individual records
  3. Varints: Variable-length integers save space
  4. Headers: Key-value metadata (tracing, routing, etc.)
  5. Idempotence fields: producerId, producerEpoch, baseSequence
Spring Kafka: Working with headers:
JAVA(38 lines)
Code
Loading syntax highlighter...

6. Log Compaction

Besides time-based retention, Kafka supports log compaction for special use cases:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      LOG COMPACTION                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   BEFORE COMPACTION (all records kept)                          β”‚
β”‚                                                                 β”‚
β”‚   Offset: 0    1    2    3    4    5    6    7    8    9        β”‚
β”‚   Key:    A    B    A    C    B    A    C    A    B    C        β”‚
β”‚   Value:  v1   v1   v2   v1   v2   v3   v2   v4   v3   v3       β”‚
β”‚                                                                 β”‚
β”‚   ───────────────────────────────────────────────────────────── β”‚
β”‚                                                                 β”‚
β”‚   AFTER COMPACTION (only latest per key)                        β”‚
β”‚                                                                 β”‚
β”‚   Offset: 7    8    9                                           β”‚
β”‚   Key:    A    B    C                                           β”‚
β”‚   Value:  v4   v3   v3                                          β”‚
β”‚                                                                 β”‚
β”‚   β€’ Keeps latest value for each key                             β”‚
β”‚   β€’ Offsets preserved (not renumbered)                          β”‚
β”‚   β€’ Tombstones (null value) delete keys                         β”‚
β”‚                                                                 β”‚
β”‚   Use cases:                                                    β”‚
β”‚   β€’ Changelog topics (Kafka Streams state)                      β”‚
β”‚   β€’ Database CDC (keep current state)                           β”‚
β”‚   β€’ Configuration/reference data                                β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Compaction configuration:
PROPERTIES(11 lines)
Code
Loading syntax highlighter...
Spring Kafka topic configuration:
JAVA(13 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Over-allocating JVM Heap

BASH(11 lines)
Code
Loading syntax highlighter...

Mistake 2: Ignoring Disk I/O Metrics

BASH(8 lines)
Code
Loading syntax highlighter...

Mistake 3: Wrong Segment Size

PROPERTIES(12 lines)
Code
Loading syntax highlighter...

Mistake 4: Running Other Apps on Kafka Servers

BASH(5 lines)
Code
Loading syntax highlighter...

Mistake 5: Misunderstanding Flush Behavior

PROPERTIES(8 lines)
Code
Loading syntax highlighter...

πŸ› Debug This

Your Kafka cluster shows these symptoms:

Consumer lag: Increasing slowly
Producer latency: p50=2ms, p99=500ms (should be <10ms)
Disk I/O: Read throughput spiking periodically
Memory: 32GB total, 24GB page cache, 6GB JVM heap

Consumers are keeping up (lag not exploding), but producer p99 latency is terrible. What's happening?

Click to reveal analysis
Analysis:
  1. Consumer lag increasing slowly: Consumers slightly behind but not badly
  2. p50=2ms, p99=500ms: Median is fine, but tail latency is awful
  3. Disk reads spiking: Something is evicting page cache periodically
  4. Memory looks OK: 24GB cache should be plenty
The clue: Disk read spikes are periodic, not constant.
Most likely cause: Consumer falling behind and requesting old data

Here's what's happening:

  1. Normal consumers read from page cache (fast)
  2. One slow consumer falls behind
  3. Slow consumer requests old data not in cache
  4. Old data must be read from disk
  5. Disk reads evict recent data from cache
  6. Now even current producers' writes miss cache
  7. Producer p99 spikes during these disk reads
Investigation:
BASH(7 lines)
Code
Loading syntax highlighter...
Fix:
  1. Find and fix the slow consumer
  2. Or use quotas to limit its fetch rate
  3. Or increase partition count to distribute load
Key insight: One slow consumer can affect all producers by thrashing the page cache.

πŸ’» Exercises

Exercise 1: Examine Segment Files

BASH(7 lines)
Code
Loading syntax highlighter...

Exercise 2: Monitor Page Cache

BASH(7 lines)
Code
Loading syntax highlighter...

Exercise 3: Measure Zero-Copy Impact

Compare sending a large file via traditional read/write vs sendfile:

JAVA(8 lines)
Code
Loading syntax highlighter...

Exercise 4: Explore Record Format

BASH(8 lines)
Code
Loading syntax highlighter...

Exercise 5: Configure Compaction

Create a compacted topic and observe behavior:

JAVA(12 lines)
Code
Loading syntax highlighter...

🎀 Interview Questions

Q1: Why is Kafka faster than traditional message brokers like RabbitMQ?

Answer: Kafka's performance comes from several architectural choices:
  1. Append-only log: All writes are sequential appends, O(1) regardless of data size. No random I/O for writes.
  2. Zero-copy transfer: Data goes directly from page cache to network socket via sendfile(). Never enters JVM heap, no serialization/deserialization by broker.
  3. Page cache reliance: Kafka leverages OS page cache instead of in-process caching. Recent data stays in RAM automatically.
  4. Batching everywhere: Records batched at producer, transferred in batches, written in batches. Amortizes overhead.
  5. Sequential reads: Consumers read in order, which is optimal for disk and cache.
  6. No per-message index: Traditional brokers track each message for deletion. Kafka just appends and deletes whole segments.

The combination means Kafka can handle millions of messages/second with modest hardware.


Q2: Explain the role of page cache in Kafka's architecture.

Answer: Page cache is OS-managed memory that caches disk data. Kafka relies on it heavily:
Write path:
  • Producer sends message β†’ Broker writes to page cache β†’ Acknowledgment sent
  • Actual disk write happens asynchronously by OS
  • This is safe because Kafka has replication for durability
Read path:
  • Consumer requests data β†’ Check page cache
  • If hit: Zero-copy from cache to socket (microseconds)
  • If miss: Read from disk into cache, then send (milliseconds)
Why it's brilliant:
  1. No duplicate caching: JVM heap cache would duplicate page cache
  2. Survives restarts: Page cache persists across JVM restarts
  3. Automatic management: OS handles eviction based on access patterns
  4. Works with zero-copy: sendfile() transfers directly from cache
Key implication: Kafka performance depends on page cache availability. Competition for memory (large heap, other processes) hurts Kafka.

Q3: What are segment files and why does Kafka use them?

Answer: Segments are the physical files that store partition data:
partition-0/
β”œβ”€β”€ 00000000000000000000.log    (messages offset 0-1000)
β”œβ”€β”€ 00000000000000001001.log    (messages offset 1001-2000)
└── 00000000000000002001.log    (active segment)
Why segments?
  1. Efficient deletion: When retention expires, delete entire segment file. No need to rewrite data.
  2. Efficient compaction: Process one segment at a time. Completed segments are immutable.
  3. Parallel operations: Can read old segments while writing to active segment.
  4. Memory mapping: Smaller files are easier to mmap and manage in page cache.
  5. Recovery: On restart, only need to validate active segment. Old segments are known-good.
Segment boundaries:
  • Roll to new segment when size exceeds log.segment.bytes (default 1GB)
  • Or time exceeds log.segment.ms (default 7 days)

Q4: What is log compaction and when would you use it?

Answer: Log compaction retains only the latest value for each key:
Before: key=A:v1, key=B:v1, key=A:v2, key=B:v2, key=A:v3
After:  key=A:v3, key=B:v2
Use cases:
  1. Changelog topics: Kafka Streams stores state as compacted topics. On restart, replays latest state per key.
  2. CDC (Change Data Capture): Database changes streamed to Kafka. Keep current row state, not full history.
  3. Configuration data: Store current config per key. Consumers get latest values.
  4. User profiles: Keep latest profile state without unbounded growth.
Key properties:
  • Offsets preserved (not renumbered)
  • Tombstones (null value) delete keys
  • Only applied to closed segments (active segment untouched)
  • cleanup.policy=compact enables it
Not suitable for: Event logs where you need full history (orders, transactions).

Q5: How does zero-copy transfer work and why is it important for Kafka?

Answer: Zero-copy uses sendfile() syscall to transfer data without CPU copying:
Traditional transfer (4 copies):
  1. Disk β†’ Kernel buffer (DMA)
  2. Kernel buffer β†’ User buffer (CPU copy)
  3. User buffer β†’ Socket buffer (CPU copy)
  4. Socket buffer β†’ NIC (DMA)
Zero-copy (2 copies):
  1. Disk β†’ Kernel buffer (DMA)
  2. Kernel buffer β†’ NIC (DMA via scatter/gather)
Why it matters for Kafka:
  1. No CPU overhead: Data never touches CPU, freeing it for other work
  2. No JVM involvement: Data bypasses JVM heap, no GC pressure
  3. No serialization: Broker doesn't parse messages, just moves bytes
  4. Higher throughput: Limited by NIC speed, not CPU
Implementation: Java's FileChannel.transferTo() maps to native sendfile().
Requirement: Only works when data is already in correct format on diskβ€”which is why Kafka stores messages in the same format they're sent over the network.

πŸ“ Summary & Key Takeaways

Core Concepts

  1. Log abstraction: Append-only, ordered, immutable sequence
  2. Segments: Partition split into files for efficient management
  3. Zero-copy: sendfile() bypasses CPU for data transfer
  4. Page cache: OS-managed RAM cache for disk data
  5. Record batch: Multiple messages compressed together

Performance Principles

PrincipleImplementation
Sequential I/OAppend-only writes, ordered reads
BatchingRecord batches reduce I/O calls
No parsingBroker moves bytes, doesn't understand content
OS leveragePage cache instead of JVM heap
Zero-copysendfile() for network transfer

Key Configurations

PROPERTIES(10 lines)
Code
Loading syntax highlighter...

πŸ“‹ Quick Reference

BASH(15 lines)
Code
Loading syntax highlighter...

πŸ“… Review Schedule

  • Day 1: Read and understand log abstraction
  • Day 3: Explore segment files on disk
  • Day 7: Monitor page cache during production
  • Day 14: Review zero-copy and performance implications
  • Day 30: Explain architecture without notes

πŸ“š Series Navigation