Architecture & Storage Engine
π At a Glance
| Aspect | Details |
|---|---|
| Difficulty | π‘ Intermediate |
| Prerequisites | Basic Kafka concepts (topics, partitions) |
| Key Concepts | Log, segments, zero-copy, page cache, message format |
| Time Investment | 32 minutes read + 45 minutes practice |
| Payoff | Understand why Kafka handles millions of messages/sec |
π― What You'll Learn
After this article, you'll be able to:
- Explain the log abstraction and why it's perfect for messaging
- Understand segment files and how Kafka stores data on disk
- Describe zero-copy transfer and why it eliminates overhead
- Leverage the page cache for performance tuning
- Parse message format including headers, keys, and timestamps
π₯ Production Story: The Page Cache Mystery
Before monitoring: Producer latency p99 = 5ms After monitoring: Producer latency p99 = 150ms (30x worse!)
Throughput dropped from 500K to 50K msg/sec. No code changes to Kafka. Same hardware.
BASH(8 lines)CodeLoading syntax highlighter...
Read I/O jumped 50x! But Kafka consumers hadn't changed.
Normally, Kafka serves reads directly from page cache (RAM). When data was evicted, Kafka had to read from diskβturning a memory operation into a disk operation.
BASH(3 lines)CodeLoading syntax highlighter...
π§ Mental Model: Kafka's Storage Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β KAFKA STORAGE ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β PRODUCER β β β β β βΌ β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β KAFKA BROKER β β β β β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β PAGE CACHE (RAM) β β β β β β βββββββββββ βββββββββββ βββββββββββ β β β β β β β Recent β β Recent β β Recent β β β β β β β β Segment β β Segment β β Segment β β β β β β β β P0 β β P1 β β P2 β β β β β β β βββββββββββ βββββββββββ βββββββββββ β β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β β β β β Zero-Copy β β β β sendfile() β β β β β β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β DISK β β β β β β β β β β β β Topic: orders β β β β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β β β β β Partition 0 β β β β β β β β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β β β β β β β β βSegment β βSegment β βSegment β βActive β β β β β β β β β β .log β β .log β β .log β βSegment β β β β β β β β β β0-1000 β β1001-2000β2001-3000β β 3001+ β β β β β β β β β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β β β β β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β Zero-Copy β β β β β βΌ β β CONSUMER β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Key Insight: Data flows from Producer β Page Cache β Disk Consumer reads: Page Cache (fast) or Disk (slow) Recent data is almost always in page cache!
π¬ Deep Dive
1. The Log Abstraction
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β KAFKA LOG β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Offset: 0 1 2 3 4 5 6 ... β β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β β β M0 β β M1 β β M2 β β M3 β β M4 β β M5 β β M6 β β β β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β β β β Properties: β β β’ Append-only (writes go to the end) β β β’ Ordered (offset determines order) β β β’ Immutable (existing records never change) β β β’ Persistent (survives restarts) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Sequential writes: Appending to the end is O(1), regardless of data size
- Sequential reads: Consumers read in order, which is disk-friendly
- Simplicity: No complex data structures, just files
- Durability: Once written, data stays until explicitly deleted
| Aspect | Traditional (e.g., RabbitMQ) | Kafka |
|---|---|---|
| Data structure | Queue (FIFO, delete on read) | Log (append-only) |
| Message delivery | Push to consumer | Consumer pulls |
| Message retention | Until acknowledged | Time or size based |
| Random access | No | Yes (by offset) |
| Replay capability | No | Yes |
2. Segments: How Data Lives on Disk
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β PARTITION SEGMENTS β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β /var/kafka/data/orders-0/ β β β β β βββ 00000000000000000000.log (offsets 0-999) β β βββ 00000000000000000000.index (offset β position) β β βββ 00000000000000000000.timeindex (timestamp β offset) β β β β β βββ 00000000000000001000.log (offsets 1000-1999) β β βββ 00000000000000001000.index β β βββ 00000000000000001000.timeindex β β β β β βββ 00000000000000002000.log (offsets 2000-2999) β β βββ 00000000000000002000.index β β βββ 00000000000000002000.timeindex β β β β β βββ 00000000000000003000.log β ACTIVE SEGMENT β β 00000000000000003000.index (writes go here) β β 00000000000000003000.timeindex β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Efficient deletion: Delete whole segment files, not individual messages
- Efficient compaction: Process segment by segment
- Parallel I/O: Different segments can be read simultaneously
- Memory mapping: Smaller files are easier to mmap
PROPERTIES(7 lines)CodeLoading syntax highlighter...
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β INDEX FILE STRUCTURE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β .index file (offset β file position) β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Offset β Position in .log file β β β ββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 0 β 0 β β β β 50 β 4096 β β β β 100 β 8192 β β β β 150 β 12288 β β β ββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Index is sparse! Not every offset is indexed. β β To find offset 75: β β 1. Binary search index β find entry β€ 75 (offset 50) β β 2. Seek to position 4096 in .log β β 3. Scan forward to find offset 75 β β β β .timeindex file (timestamp β offset) β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Timestamp β Offset β β β ββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ€ β β β 1705312800000 β 0 β β β β 1705312860000 β 1000 β β β β 1705312920000 β 2000 β β β ββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββ β β β β Used for: offsetsForTimes() API β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Zero-Copy Transfer
This is one of Kafka's biggest performance secrets.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β TRADITIONAL DATA TRANSFER β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β DISK β β β β β β 1. read() - DMA copy to kernel buffer β β βΌ β β KERNEL BUFFER (Page Cache) β β β β β β 2. CPU copy to application buffer β β βΌ β β APPLICATION BUFFER (JVM Heap) β β β β β β 3. CPU copy to socket buffer β β βΌ β β SOCKET BUFFER (Kernel) β β β β β β 4. DMA copy to NIC β β βΌ β β NETWORK β β β β Total: 4 copies, 2 kernel-user context switches β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ZERO-COPY TRANSFER β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β DISK β β β β β β 1. DMA copy to kernel buffer β β βΌ β β KERNEL BUFFER (Page Cache) β β β β β β 2. sendfile() - DMA scatter/gather to NIC β β βΌ β β NETWORK β β β β Total: 2 copies (both DMA, no CPU involved) β β 0 kernel-user context switches β β Data never enters JVM heap! β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
JAVA(6 lines)CodeLoading syntax highlighter...
| Metric | Traditional | Zero-Copy |
|---|---|---|
| CPU usage | High | Minimal |
| Memory copies | 4 | 2 |
| Context switches | 2 | 0 |
| Throughput | Limited by CPU | Limited by NIC |
4. The Page Cache: Kafka's Secret Weapon
The page cache is OS-managed memory that caches disk data:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β PAGE CACHE OPERATION β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β WRITE PATH (Producer β Broker) β β β β Producer sends message β β β β β βΌ β β Broker writes to page cache β NOT directly to disk! β β β β β βββ Acknowledgment sent to producer β β β (if acks=1 or acks=all with ISR) β β β β β βΌ β β OS flushes to disk asynchronously (later) β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β READ PATH (Broker β Consumer) β β β β Case 1: Data in page cache (FAST) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Consumer requests offset 1000 β β β β β β β β β βΌ β β β β Check page cache β HIT! (microseconds) β β β β β β β β β βΌ β β β β Zero-copy to socket β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Case 2: Data not in page cache (SLOW) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Consumer requests offset 1000 β β β β β β β β β βΌ β β β β Check page cache β MISS β β β β β β β β β βΌ β β β β Read from disk (milliseconds) β β β β β β β β β βΌ β β β β Load into page cache β β β β β β β β β βΌ β β β β Send to consumer β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Hot data stays in RAM: Recent messages (most commonly read) stay cached
- No JVM heap pressure: Page cache is outside JVM, no GC impact
- Survives restarts: JVM restart doesn't lose cached data (OS manages it)
- Automatic management: OS handles eviction, no tuning needed
BASH(14 lines)CodeLoading syntax highlighter...
BASH(14 lines)CodeLoading syntax highlighter...
5. Message Format (Record Batch)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β RECORD BATCH FORMAT β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Record Batch (v2, Kafka 0.11+) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β BATCH HEADER (61 bytes) β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β baseOffset (8 bytes) - First offset ββ β β β β batchLength (4 bytes) - Size of batch ββ β β β β partitionLeaderEpoch (4 bytes) - Leader version ββ β β β β magic (1 byte) - Format version (2) ββ β β β β crc (4 bytes) - Checksum ββ β β β β attributes (2 bytes) - Compression, etc ββ β β β β lastOffsetDelta (4 bytes) - Offset range ββ β β β β firstTimestamp (8 bytes) - Batch start time ββ β β β β maxTimestamp (8 bytes) - Batch end time ββ β β β β producerId (8 bytes) - For idempotence ββ β β β β producerEpoch (2 bytes) - Producer version ββ β β β β baseSequence (4 bytes) - For ordering ββ β β β β recordCount (4 bytes) - Number of records ββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β β β RECORDS (variable length, compressed together) β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Record 0 ββ β β β β βββ length (varint) ββ β β β β βββ attributes (1 byte) ββ β β β β βββ timestampDelta (varint) ββ β β β β βββ offsetDelta (varint) ββ β β β β βββ keyLength (varint) ββ β β β β βββ key (bytes) ββ β β β β βββ valueLength (varint) ββ β β β β βββ value (bytes) ββ β β β β βββ headersCount (varint) ββ β β β β βββ headers[] (key-value pairs) ββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββ-β€β β β β β Record 1 ββ β β β β ... ββ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Batching: Multiple records in one batch = fewer I/O operations
- Compression: Applied to whole batch, not individual records
- Varints: Variable-length integers save space
- Headers: Key-value metadata (tracing, routing, etc.)
- Idempotence fields: producerId, producerEpoch, baseSequence
JAVA(38 lines)CodeLoading syntax highlighter...
6. Log Compaction
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LOG COMPACTION β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β BEFORE COMPACTION (all records kept) β β β β Offset: 0 1 2 3 4 5 6 7 8 9 β β Key: A B A C B A C A B C β β Value: v1 v1 v2 v1 v2 v3 v2 v4 v3 v3 β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β AFTER COMPACTION (only latest per key) β β β β Offset: 7 8 9 β β Key: A B C β β Value: v4 v3 v3 β β β β β’ Keeps latest value for each key β β β’ Offsets preserved (not renumbered) β β β’ Tombstones (null value) delete keys β β β β Use cases: β β β’ Changelog topics (Kafka Streams state) β β β’ Database CDC (keep current state) β β β’ Configuration/reference data β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PROPERTIES(11 lines)CodeLoading syntax highlighter...
JAVA(13 lines)CodeLoading syntax highlighter...
β οΈ Common Mistakes
Mistake 1: Over-allocating JVM Heap
BASH(11 lines)CodeLoading syntax highlighter...
Mistake 2: Ignoring Disk I/O Metrics
BASH(8 lines)CodeLoading syntax highlighter...
Mistake 3: Wrong Segment Size
PROPERTIES(12 lines)CodeLoading syntax highlighter...
Mistake 4: Running Other Apps on Kafka Servers
BASH(5 lines)CodeLoading syntax highlighter...
Mistake 5: Misunderstanding Flush Behavior
PROPERTIES(8 lines)CodeLoading syntax highlighter...
π Debug This
Your Kafka cluster shows these symptoms:
Consumer lag: Increasing slowly Producer latency: p50=2ms, p99=500ms (should be <10ms) Disk I/O: Read throughput spiking periodically Memory: 32GB total, 24GB page cache, 6GB JVM heap
Consumers are keeping up (lag not exploding), but producer p99 latency is terrible. What's happening?
Click to reveal analysis
- Consumer lag increasing slowly: Consumers slightly behind but not badly
- p50=2ms, p99=500ms: Median is fine, but tail latency is awful
- Disk reads spiking: Something is evicting page cache periodically
- Memory looks OK: 24GB cache should be plenty
Here's what's happening:
- Normal consumers read from page cache (fast)
- One slow consumer falls behind
- Slow consumer requests old data not in cache
- Old data must be read from disk
- Disk reads evict recent data from cache
- Now even current producers' writes miss cache
- Producer p99 spikes during these disk reads
BASH(7 lines)CodeLoading syntax highlighter...
- Find and fix the slow consumer
- Or use quotas to limit its fetch rate
- Or increase partition count to distribute load
π» Exercises
Exercise 1: Examine Segment Files
BASH(7 lines)CodeLoading syntax highlighter...
Exercise 2: Monitor Page Cache
BASH(7 lines)CodeLoading syntax highlighter...
Exercise 3: Measure Zero-Copy Impact
Compare sending a large file via traditional read/write vs sendfile:
JAVA(8 lines)CodeLoading syntax highlighter...
Exercise 4: Explore Record Format
BASH(8 lines)CodeLoading syntax highlighter...
Exercise 5: Configure Compaction
Create a compacted topic and observe behavior:
JAVA(12 lines)CodeLoading syntax highlighter...
π€ Interview Questions
Q1: Why is Kafka faster than traditional message brokers like RabbitMQ?
-
Append-only log: All writes are sequential appends, O(1) regardless of data size. No random I/O for writes.
-
Zero-copy transfer: Data goes directly from page cache to network socket via
sendfile(). Never enters JVM heap, no serialization/deserialization by broker. -
Page cache reliance: Kafka leverages OS page cache instead of in-process caching. Recent data stays in RAM automatically.
-
Batching everywhere: Records batched at producer, transferred in batches, written in batches. Amortizes overhead.
-
Sequential reads: Consumers read in order, which is optimal for disk and cache.
-
No per-message index: Traditional brokers track each message for deletion. Kafka just appends and deletes whole segments.
The combination means Kafka can handle millions of messages/second with modest hardware.
Q2: Explain the role of page cache in Kafka's architecture.
- Producer sends message β Broker writes to page cache β Acknowledgment sent
- Actual disk write happens asynchronously by OS
- This is safe because Kafka has replication for durability
- Consumer requests data β Check page cache
- If hit: Zero-copy from cache to socket (microseconds)
- If miss: Read from disk into cache, then send (milliseconds)
- No duplicate caching: JVM heap cache would duplicate page cache
- Survives restarts: Page cache persists across JVM restarts
- Automatic management: OS handles eviction based on access patterns
- Works with zero-copy:
sendfile()transfers directly from cache
Q3: What are segment files and why does Kafka use them?
partition-0/ βββ 00000000000000000000.log (messages offset 0-1000) βββ 00000000000000001001.log (messages offset 1001-2000) βββ 00000000000000002001.log (active segment)
-
Efficient deletion: When retention expires, delete entire segment file. No need to rewrite data.
-
Efficient compaction: Process one segment at a time. Completed segments are immutable.
-
Parallel operations: Can read old segments while writing to active segment.
-
Memory mapping: Smaller files are easier to mmap and manage in page cache.
-
Recovery: On restart, only need to validate active segment. Old segments are known-good.
- Roll to new segment when size exceeds
log.segment.bytes(default 1GB) - Or time exceeds
log.segment.ms(default 7 days)
Q4: What is log compaction and when would you use it?
Before: key=A:v1, key=B:v1, key=A:v2, key=B:v2, key=A:v3 After: key=A:v3, key=B:v2
-
Changelog topics: Kafka Streams stores state as compacted topics. On restart, replays latest state per key.
-
CDC (Change Data Capture): Database changes streamed to Kafka. Keep current row state, not full history.
-
Configuration data: Store current config per key. Consumers get latest values.
-
User profiles: Keep latest profile state without unbounded growth.
- Offsets preserved (not renumbered)
- Tombstones (null value) delete keys
- Only applied to closed segments (active segment untouched)
cleanup.policy=compactenables it
Q5: How does zero-copy transfer work and why is it important for Kafka?
sendfile() syscall to transfer data without CPU copying:- Disk β Kernel buffer (DMA)
- Kernel buffer β User buffer (CPU copy)
- User buffer β Socket buffer (CPU copy)
- Socket buffer β NIC (DMA)
- Disk β Kernel buffer (DMA)
- Kernel buffer β NIC (DMA via scatter/gather)
-
No CPU overhead: Data never touches CPU, freeing it for other work
-
No JVM involvement: Data bypasses JVM heap, no GC pressure
-
No serialization: Broker doesn't parse messages, just moves bytes
-
Higher throughput: Limited by NIC speed, not CPU
FileChannel.transferTo() maps to native sendfile().π Summary & Key Takeaways
Core Concepts
- Log abstraction: Append-only, ordered, immutable sequence
- Segments: Partition split into files for efficient management
- Zero-copy:
sendfile()bypasses CPU for data transfer - Page cache: OS-managed RAM cache for disk data
- Record batch: Multiple messages compressed together
Performance Principles
| Principle | Implementation |
|---|---|
| Sequential I/O | Append-only writes, ordered reads |
| Batching | Record batches reduce I/O calls |
| No parsing | Broker moves bytes, doesn't understand content |
| OS leverage | Page cache instead of JVM heap |
| Zero-copy | sendfile() for network transfer |
Key Configurations
PROPERTIES(10 lines)CodeLoading syntax highlighter...
π Quick Reference
BASH(15 lines)CodeLoading syntax highlighter...
π Review Schedule
- Day 1: Read and understand log abstraction
- Day 3: Explore segment files on disk
- Day 7: Monitor page cache during production
- Day 14: Review zero-copy and performance implications
- Day 30: Explain architecture without notes
π Series Navigation
- Previous: Part 0 - How to Use This Series
- Next: Part 2 - Partitions & Replication
- Index: Kafka Compendium Series