Devops

Image Anatomy: Layers, Manifests & Registry

You docker pull an image and it downloads in layers. But what are layers really? How does Docker know which parts to download? This article dissects Docker images - from the overlay filesystem to multi-architecture manifests to content-addressable storage.

📋 At a Glance

AspectDetails
TopicImage layers, content-addressable storage, registries, multi-arch
ComplexityAdvanced
PrerequisitesPart 1 (Container Internals), basic Docker usage
Key InsightImages are just tarballs with metadata, layers are deduplicated filesystem diffs
Time to Master3-4 hours

🎯 What You'll Learn

  • Layer mechanics - how images are built from filesystem diffs
  • Content-addressable storage - why layer hashes matter
  • Image manifests - the metadata that ties layers together
  • Multi-architecture images - how one tag serves ARM and AMD64
  • Registry protocol - what happens during push/pull

🔥 Production Story: 50GB of Dangling Layers

A CI server's disk filled up every week. The team blamed "too many builds" and added cron jobs to clean up. But the real problem was worse.

Investigation revealed:
  • 847 dangling images consuming 47GB
  • Each build pulled base image, built, but never cleaned intermediates
  • Layer deduplication wasn't working - same base layers stored multiple times
Root cause: They were using docker build without --rm and never running docker image prune. But deeper: they didn't understand that every build command creates a layer, and layers persist until explicitly removed.
The fix:
BASH(8 lines)
Code
Loading syntax highlighter...
What they learned: Image storage isn't magic. Every layer is a tarball on disk. Understanding layer structure means understanding disk usage.

🧠 Mental Model: Images as Layer Stacks

┌─────────────────────────────────────────────────────────────────┐
│                        IMAGE STRUCTURE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Image: nginx:1.25                                              │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Manifest (JSON)                                            │ │
│  │ - Config blob reference                                    │ │
│  │ - Layer references (in order)                              │ │
│  │ - Media types                                              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                           │                                     │
│                           ▼                                     │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Config (JSON)                                              │ │
│  │ - Environment variables                                    │ │
│  │ - CMD, ENTRYPOINT                                          │ │
│  │ - Exposed ports                                            │ │
│  │ - History (build steps)                                    │ │
│  └────────────────────────────────────────────────────────────┘ │
│                           │                                     │
│                           ▼                                     │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ LAYER 4: sha256:a1b2... (12MB)  ← nginx config             │ │
│  │ Changes: /etc/nginx/nginx.conf, /usr/share/nginx/html/     │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │ LAYER 3: sha256:c3d4... (25MB)  ← nginx binary             │ │
│  │ Changes: /usr/sbin/nginx, /usr/lib/nginx/                  │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │ LAYER 2: sha256:e5f6... (45MB)  ← apt packages             │ │
│  │ Changes: /usr/bin/*, /usr/lib/*                            │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │ LAYER 1: sha256:7890... (80MB)  ← debian:bookworm-slim     │ │
│  │ Changes: /bin/*, /lib/*, /etc/*                            │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                 │
│  Total size: 162MB (but layers shared with other images!)       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Key insight: Layers are filesystem diffs, not complete snapshots. Each layer contains only what changed from the layer below.

🔬 Deep Dive

How Layers Work: Union Filesystem

Docker uses overlay filesystem (overlayfs) to stack layers:
┌─────────────────────────────────────────────────────────────────┐
│                     CONTAINER FILESYSTEM VIEW                   │
│                                                                 │
│  What container sees: /bin /etc /usr /var /app ...              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐  Container layer (read-write)                 │
│  │   UPPER      │  - Container's changes go here                │
│  │   (rw)       │  - Created fresh for each container           │
│  └──────────────┘                                               │
│         │                                                       │
│         │ overlay merge                                         │
│         ▼                                                       │
│  ┌──────────────┐                                               │
│  │   LAYER 4    │  Image layers (read-only)                     │
│  │   (ro)       │                                               │
│  ├──────────────┤                                               │
│  │   LAYER 3    │                                               │
│  │   (ro)       │                                               │
│  ├──────────────┤                                               │
│  │   LAYER 2    │                                               │
│  │   (ro)       │                                               │
│  ├──────────────┤                                               │
│  │   LAYER 1    │                                               │
│  │   (ro)       │                                               │
│  └──────────────┘                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

File resolution (top-down):
1. Check UPPER - if file exists, use it
2. Check LAYER 4 - if file exists, use it
3. Check LAYER 3...
4. Continue down until found

Write operations:
- New file → write to UPPER
- Modify existing → copy to UPPER, modify (copy-on-write)
- Delete → create "whiteout" marker in UPPER
See it in action:
BASH(14 lines)
Code
Loading syntax highlighter...

Content-Addressable Storage

Every piece of image data is identified by its SHA256 hash:
BASH(13 lines)
Code
Loading syntax highlighter...
Why content-addressable storage matters:
  1. Deduplication: Same layer content = same hash = store once
  2. Integrity: Downloaded data must match expected hash
  3. Caching: Already have this hash? Don't download again
  4. Immutability: Can't modify a layer without changing its hash
BASH(5 lines)
Code
Loading syntax highlighter...

Image Manifest Structure

The manifest ties everything together:

BASH(2 lines)
Code
Loading syntax highlighter...
JSON(21 lines)
Code
Loading syntax highlighter...
Manifest components:
FieldPurpose
config.digestPoints to image config JSON
layers[].digestSHA256 of compressed layer tarball
layers[].sizeSize in bytes (for progress display)
mediaTypeFormat identifier

Image Config Deep Dive

BASH(6 lines)
Code
Loading syntax highlighter...
JSON(31 lines)
Code
Loading syntax highlighter...
History section reveals build steps - useful for understanding how an image was built.

Multi-Architecture Images (Manifest Lists)

One tag can serve multiple architectures:

BASH(2 lines)
Code
Loading syntax highlighter...
JSON(32 lines)
Code
Loading syntax highlighter...
How it works:
┌─────────────────────────────────────────────────────────────────┐
│                   docker pull nginx:1.25                        │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Registry Response                            │
│                                                                 │
│  Manifest List (nginx:1.25)                                     │
│  ├─ amd64/linux → sha256:amd64manifest                          │
│  ├─ arm64/linux → sha256:arm64manifest                          │
│  └─ arm/v7/linux → sha256:armv7manifest                         │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           │ Client selects based on
                           │ local architecture
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│  On M1 Mac: Pull sha256:arm64manifest                           │
│  On x86 PC: Pull sha256:amd64manifest                           │
└─────────────────────────────────────────────────────────────────┘
Creating multi-arch images:
BASH(11 lines)
Code
Loading syntax highlighter...

Registry Protocol (OCI Distribution)

What happens during docker pull:
┌─────────────────────────────────────────────────────────────────┐
│                    docker pull nginx:1.25                       │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Resolve tag to manifest                                 │
│                                                                 │
│ GET /v2/library/nginx/manifests/1.25                            │
│ Accept: application/vnd.docker.distribution.manifest.v2+json    │
│                                                                 │
│ Response: Manifest JSON with config + layer digests             │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Download config blob                                    │
│                                                                 │
│ GET /v2/library/nginx/blobs/sha256:configdigest                 │
│                                                                 │
│ Response: Image config JSON                                     │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Download layers (parallel, if not cached)               │
│                                                                 │
│ For each layer:                                                 │
│   - Check if sha256:layerdigest exists locally                  │
│   - If not: GET /v2/library/nginx/blobs/sha256:layerdigest      │
│   - Verify downloaded hash matches                              │
│   - Decompress and store                                        │
└─────────────────────────────────────────────────────────────────┘
What happens during docker push:
BASH(5 lines)
Code
Loading syntax highlighter...

Analyzing Image Layers

BASH(16 lines)
Code
Loading syntax highlighter...

Layer Caching Mechanics

Why does rebuild sometimes use cache, sometimes not?

DOCKERFILE(7 lines)
Code
Loading syntax highlighter...
Cache decision for each instruction:
InstructionCache Key
FROMBase image digest
WORKDIRPrevious layer + instruction
COPYPrevious layer + file content hashes
RUNPrevious layer + command string
CMDPrevious layer + command
Cache invalidation cascade:
If package.json changes:
  ├─ COPY package.json . → MISS (file changed)
  ├─ RUN npm install → MISS (parent changed)
  ├─ COPY . . → MISS (parent changed)
  └─ All subsequent layers rebuild

If only src/index.js changes:
  ├─ COPY package.json . → HIT
  ├─ RUN npm install → HIT
  ├─ COPY . . → MISS (file changed)
  └─ Only this and following rebuild
This is why order matters in Dockerfiles (covered in Part 5).

Where Images Live on Disk

BASH(16 lines)
Code
Loading syntax highlighter...
Cleanup commands:
BASH(12 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: Not Understanding Layer Accumulation

DOCKERFILE(17 lines)
Code
Loading syntax highlighter...

Mistake 2: Ignoring Image Provenance

BASH(14 lines)
Code
Loading syntax highlighter...

Mistake 3: Using :latest in Production

BASH(11 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Missing Layer Mystery

A developer reports: "I pushed my image, then pulled on another machine, but it's different. Some files are missing!"

BASH(13 lines)
Code
Loading syntax highlighter...
Why are files missing on the second machine?

✅ Solution:
The issue is likely layer caching with different base layers.
Scenario 1: Old cached layers

The second machine had an old version of a base layer cached:

BASH(10 lines)
Code
Loading syntax highlighter...
Scenario 2: Build cache leak

The build machine used cached layers from a previous build that weren't pushed:

BASH(7 lines)
Code
Loading syntax highlighter...
Scenario 3: Multi-stage build artifact

Files were in build stage but not copied to final stage:

DOCKERFILE(9 lines)
Code
Loading syntax highlighter...
Debug steps:
BASH(10 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Dissect an Image

⭐ Difficulty: Easy | ⏱️ Time: 15 minutes

BASH(23 lines)
Code
Loading syntax highlighter...

Exercise 2: Measure Layer Deduplication

⭐⭐ Difficulty: Medium | ⏱️ Time: 20 minutes

BASH(19 lines)
Code
Loading syntax highlighter...

Exercise 3: Build Multi-Architecture Image

⭐⭐ Difficulty: Medium | ⏱️ Time: 25 minutes

BASH(25 lines)
Code
Loading syntax highlighter...

Exercise 4: Registry Protocol Deep Dive

⭐⭐⭐ Difficulty: Hard | ⏱️ Time: 30 minutes

BASH(26 lines)
Code
Loading syntax highlighter...

Exercise 5: Analyze Image Efficiency

⭐⭐⭐⭐ Difficulty: Expert | ⏱️ Time: 30 minutes

BASH(40 lines)
Code
Loading syntax highlighter...

🎤 Senior-Level Interview Questions

Q1: Explain the difference between image ID, digest, and tag.

Strong Answer:

"These are three different ways to reference images:

Tag is a human-readable name like nginx:1.25. It's mutable - pushing a new image with the same tag overwrites the reference. Never use :latest in production because it can change.
Digest is the SHA256 hash of the image manifest, like sha256:abc123.... It's immutable - this exact hash always refers to exactly this image. Format: nginx@sha256:abc123...
Image ID is the SHA256 hash of the image config JSON (not the manifest). It's what you see in docker images. Two images with different tags can have the same ID if they're identical.

In practice:

  • Use tags for development convenience
  • Use digests for production deployments (immutability)
  • Image IDs are mainly for local identification

The relationship: Tag → Manifest (has digest) → Config (has ID) + Layers"

Q2: How does layer caching work and why does COPY order matter?

Strong Answer:

"Layer caching is Docker's optimization to avoid rebuilding unchanged layers.

For each instruction, Docker checks if it can reuse a cached layer:

  • FROM: Cache hit if base image digest matches
  • RUN: Cache hit if parent layer AND command string match
  • COPY/ADD: Cache hit if parent layer AND all source file contents match
The key insight: cache invalidation cascades. If layer N misses cache, all layers N+1, N+2... must rebuild.

This is why COPY order matters:

DOCKERFILE(8 lines)
Code
Loading syntax highlighter...
In the good version, npm install only reruns if package.json changes. The COPY . . for code changes doesn't affect it because it comes after.

Same principle applies to any slow step: put dependencies before code, rarely-changing before frequently-changing."

Q3: What happens during docker pull at the network level?

Strong Answer:

"Docker pull follows the OCI Distribution protocol:

  1. Resolve tag to manifest: GET to /v2/<name>/manifests/<tag>. Registry returns manifest JSON with config and layer digests.
  2. Check for multi-arch: If it's a manifest list, Docker selects the manifest matching local architecture.
  3. Download config: GET to /v2/<name>/blobs/<config-digest>. This is the image configuration JSON.
  4. Download layers (parallel):
    • For each layer, check local storage for matching digest
    • If missing: GET to /v2/<name>/blobs/<layer-digest>
    • Response is gzipped tarball
    • Verify SHA256 matches expected digest
    • Extract to local storage
  5. Assemble: Create local image metadata linking config to layers.

Key optimizations:

  • Layers download in parallel
  • Already-cached layers skip network entirely
  • Registries support range requests for resumable downloads
  • CDN acceleration for popular images"

Q4: How would you debug an image that's larger than expected?

Strong Answer:

"My debugging process:

  1. Quick size check:
    BASH(2 lines)
    Code
    Loading syntax highlighter...
  2. Layer analysis with dive:
    BASH
    Code
    Loading syntax highlighter...

    This shows exactly which files are in each layer and flags wasted space.

  3. Common issues I look for:
    • Build artifacts in final image (node_modules dev deps, .git, test files)
    • Package manager cache not cleaned
    • Multiple RUN statements that could combine
    • Missing .dockerignore
    • Wrong base image (ubuntu vs alpine vs distroless)
  4. Multi-stage check:
    • Are we copying only needed artifacts?
    • Any unnecessary COPY commands?
  5. Concrete fixes:
    • Add .dockerignore for build context
    • Combine RUN commands with cleanup
    • Use smaller base image
    • Multi-stage build for compiled languages
    • Remove dev dependencies

I'd also check if the team has image size targets in CI to prevent regression."

Q5: Explain content-addressable storage and why it matters.

Strong Answer:

"Content-addressable storage means every blob is identified by the SHA256 hash of its contents. The hash IS the address.

Why this matters:

  1. Deduplication: If two images share a layer (same content = same hash), it's stored once. A host running 100 containers might only have 10 unique layers on disk.
  2. Integrity: When downloading sha256:abc123, you hash the received data. If it doesn't match, something went wrong (corruption, MITM). Immutable verification.
  3. Immutability: You can't modify a layer without changing its hash. This enables safe caching - if you have sha256:abc123, you know exactly what it contains, forever.
  4. Efficient distribution: Registries and clients can skip transferring layers they already have. docker pull is essentially 'sync these hashes'.
The tradeoff: tags become problematic. nginx:latest is content-addressed under the hood, but the tag can point to different digests over time. Production should use digest references: nginx@sha256:abc..."

📝 Summary & Key Takeaways

Core Concepts

ConceptKey Point
LayersFilesystem diffs stacked with overlay FS
Content-addressableSHA256 hash = identity, enables deduplication
ManifestJSON listing config + layers, ties image together
Multi-archManifest list points to arch-specific manifests
Registry protocolStandard HTTP API for push/pull

The Image Equation

Image = Manifest + Config + Layers

Manifest: "Here's what this image contains"
  → Config digest (how to run)
  → Layer digests (filesystem content)

Config: "Here's how to run this image"
  → CMD, ENV, EXPOSE, etc.
  → Build history

Layers: "Here's the filesystem"
  → Ordered tarballs
  → Each is diff from previous
  → Stacked by overlay filesystem

What You Can Do Now

  1. Analyze images: Use docker history and dive to understand composition
  2. Debug size issues: Identify which layers contribute most
  3. Understand caching: Know why builds are slow or fast
  4. Use digests: Reference immutable images in production

📋 Quick Reference

Image Inspection Commands

BASH(14 lines)
Code
Loading syntax highlighter...

Registry API Endpoints

EndpointPurpose
GET /v2/_catalogList repositories
GET /v2/<name>/tags/listList tags
GET /v2/<name>/manifests/<ref>Get manifest
GET /v2/<name>/blobs/<digest>Get layer/config
HEAD /v2/<name>/blobs/<digest>Check if blob exists

Disk Usage Commands

BASH(10 lines)
Code
Loading syntax highlighter...

📅 Review Schedule

DayTaskTime
Day 1Draw layer stacking diagram from memory10 min
Day 3Do Exercise 1 (dissect an image)15 min
Day 7Explain content-addressable storage to colleague5 min
Day 14Do Exercise 4 (registry protocol)30 min
Day 30Analyze a production image with dive20 min

📚 Series Navigation

PreviousCurrentNext
Part 1: Container InternalsPart 2: Image AnatomyPart 3: Build Process
Docker Compendium Series: