HashSet Internals

HashSet is Java's primary implementation of a Set - a collection that contains no duplicate elements. But here's the secret: HashSet is just a HashMap in disguise. Understanding this relationship unlocks both Set and Map knowledge simultaneously.

This article explores how HashSet enforces uniqueness, why iteration order is unpredictable, and the common pitfalls that cause "duplicates" to appear in what should be unique collections.

📋 At a Glance

Aspect	Details
Internal Implementation	Backed by `HashMap<E, Object>`
Uniqueness Mechanism	Uses HashMap's key uniqueness
Time Complexity	O(1) average for add/remove/contains
Iteration Order	Undefined (hash-based, not predictable)
Null Elements	Allows one null element
Thread Safety	Not thread-safe

🎯 What You'll Learn

After reading this article, you will be able to:

Explain the HashMap backing: How HashSet uses HashMap internally
Understand uniqueness: Why equals() and hashCode() are critical
Perform set operations: Union, intersection, difference efficiently
Avoid common traps: Mutable elements, incorrect equals/hashCode
Choose the right Set: HashSet vs TreeSet vs LinkedHashSet

🔥 Production Story: The Duplicate Detection Failure

An e-commerce platform needed to prevent duplicate orders from being processed. The solution seemed simple:

JAVA(14 lines)
Code
Loading syntax highlighter...

The bug: Duplicate orders were being processed anyway!

JAVA(23 lines)
Code
Loading syntax highlighter...

What happened:

JAVA(9 lines)
Code
Loading syntax highlighter...

The fix:

JAVA(17 lines)
Code
Loading syntax highlighter...

Lesson: HashSet uniqueness depends entirely on your equals/hashCode implementation. Use only immutable, identifying fields.

🧠 Mental Model: HashSet = HashMap with Dummy Values

The simplest way to understand HashSet is to see what it actually is:

TEXT(31 lines)
Code
Loading syntax highlighter...

Key insight: HashSet is a thin wrapper that delegates everything to HashMap. If you understand HashMap (Part 12-13), you already understand HashSet.

🔬 Deep Dive

1. The Internal Structure

JAVA(28 lines)
Code
Loading syntax highlighter...

2. Core Operations

Every HashSet operation is a HashMap operation:

JAVA(39 lines)
Code
Loading syntax highlighter...

3. Why PRESENT and Not Null?

You might wonder: why use a dummy PRESENT object instead of null?

JAVA(10 lines)
Code
Loading syntax highlighter...

JAVA(7 lines)
Code
Loading syntax highlighter...

4. Iteration Order: Why It's Unpredictable

HashSet iteration follows HashMap's bucket order:

JAVA(12 lines)
Code
Loading syntax highlighter...

Why? The iterator walks through the bucket array sequentially, returning elements from each non-empty bucket. The order depends on:

Hash codes of elements
Current capacity of the map
Hash function's bit mixing

TEXT(12 lines)
Code
Loading syntax highlighter...

If you need predictable order:

Insertion order: Use LinkedHashSet
Sorted order: Use TreeSet

5. Set Algebra Operations

HashSet supports standard set operations through Collection methods:

JAVA(25 lines)
Code
Loading syntax highlighter...

Performance note: These operations are O(n) where n is the size of the smaller set (for intersection/difference) or both sets (for union).

6. Initial Capacity Planning

Like HashMap, HashSet benefits from proper initial capacity:

JAVA(10 lines)
Code
Loading syntax highlighter...

Formula: capacity = (int)(expectedSize / loadFactor) + 1

With default load factor 0.75:

1,000 elements → capacity ~1,334
10,000 elements → capacity ~13,334
1,000,000 elements → capacity ~1,333,334

⚠️ Common Mistakes

Mistake 1: Mutable Elements in HashSet

❌ Problem:

JAVA(12 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(16 lines)
Code
Loading syntax highlighter...

Mistake 2: Incorrect equals/hashCode Leading to "Duplicates"

❌ Problem:

JAVA(19 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(22 lines)
Code
Loading syntax highlighter...

Mistake 3: Expecting Iteration Order

❌ Problem:

JAVA(8 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(15 lines)
Code
Loading syntax highlighter...

Mistake 4: Inefficient Duplicate Detection

❌ Problem:

JAVA(10 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(11 lines)
Code
Loading syntax highlighter...

Mistake 5: Using HashSet with Non-Hashable Types

❌ Problem:

JAVA(6 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(19 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Mystery Duplicates

A developer reports: "I'm using HashSet but still getting duplicates!"

JAVA(31 lines)
Code
Loading syntax highlighter...

Why are there duplicates? Find and fix the bug.

✅ Solution:

The developer implemented equals() but not hashCode(). The toString() method is irrelevant for Set behavior.

Without a proper hashCode():

Each CacheKey uses Object.hashCode() (memory address)
Two equal CacheKeys have different hash codes
They land in different buckets
HashSet never calls equals() because they're in different buckets

JAVA(36 lines)
Code
Loading syntax highlighter...

Better approach - use Record:

JAVA(3 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Efficient Deduplication

⭐⭐ Difficulty: Medium | ⏱️ Time: 10 minutes

Task: Remove duplicates from a list while preserving the order of first occurrences.

JAVA(10 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(36 lines)
Code
Loading syntax highlighter...

Exercise 2: Find Duplicates Between Lists

⭐⭐ Difficulty: Medium | ⏱️ Time: 15 minutes

Task: Find elements that appear in both lists efficiently.

JAVA(11 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(46 lines)
Code
Loading syntax highlighter...

Exercise 3: Set Algebra Operations

⭐⭐⭐ Difficulty: Medium | ⏱️ Time: 15 minutes

Task: Implement all standard set operations.

JAVA(9 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(78 lines)
Code
Loading syntax highlighter...

Exercise 4: Custom Hash Distribution Analyzer

⭐⭐⭐ Difficulty: Medium-Hard | ⏱️ Time: 20 minutes

Task: Create a utility to analyze how well elements are distributed across HashSet buckets.

JAVA(9 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(83 lines)
Code
Loading syntax highlighter...

Exercise 5: EnumSet Comparison

⭐⭐ Difficulty: Medium | ⏱️ Time: 10 minutes

Task: Compare HashSet and EnumSet for enum values.

JAVA(11 lines)
Code
Loading syntax highlighter...

✅ Solution:

JAVA(62 lines)
Code
Loading syntax highlighter...

📝 Summary & Key Takeaways

TEXT(12 lines)
Code
Loading syntax highlighter...

Performance Characteristics

Operation	Average	Worst Case	Notes
add(E)	O(1)	O(n)	Worst with many collisions
remove(E)	O(1)	O(n)	Worst with many collisions
contains(E)	O(1)	O(n)	Worst with many collisions
size()	O(1)	O(1)	Just returns field
isEmpty()	O(1)	O(1)	Checks size == 0
iterator.next()	O(1)*	O(n)	*Amortized across full iteration

When to Use Which Set

Use Case	Best Set	Why
Default, no ordering	HashSet	O(1) operations
Maintain insertion order	LinkedHashSet	O(1) + predictable iteration
Sorted iteration	TreeSet	O(log n) but sorted
Enum values	EnumSet	Bit vector, extremely fast
Thread-safe	ConcurrentHashMap.newKeySet()	Concurrent access
Immutable	Set.of()	No modification possible

Critical Rules

Always override both equals() AND hashCode() for elements
Use immutable elements or never modify after adding
Pre-size for large sets to avoid resize overhead
Don't rely on iteration order - use LinkedHashSet if needed
Use EnumSet for enum values - much more efficient

🎤 Senior-Level Interview Questions

Question #1: How does HashSet ensure uniqueness?

What interviewers want to hear: Understanding that HashSet is backed by HashMap.

Strong answer: HashSet uses a HashMap internally where elements are keys and values are a constant dummy object (PRESENT). When you call add(element), it actually calls map.put(element, PRESENT). HashMap ensures key uniqueness - if the key already exists, put() returns the old value (PRESENT), and HashSet's add() returns false. If the key is new, put() returns null, and add() returns true. The uniqueness check uses hashCode() to find the bucket, then equals() to check for duplicates in that bucket.

Question #2: What is the time complexity of add/remove/contains in HashSet?

What interviewers want to hear: Understanding of average vs worst case.

Strong answer: Average case is O(1) for all three operations because they're backed by HashMap's O(1) bucket lookup. However, worst case is O(n) when all elements hash to the same bucket (hash collision attack scenario). In Java 8+, this worst case improves to O(log n) due to tree bins that activate when a bucket has 8+ entries. The O(1) depends on a good hash function distributing elements across buckets evenly.

Question #3: Why doesn't HashSet maintain insertion order?

What interviewers want to hear: Understanding of hash-based storage.

Strong answer: HashSet stores elements in buckets determined by hash code, not insertion order. The iterator walks through the bucket array sequentially, returning elements from each non-empty bucket. This order depends on hash codes and current capacity - adding more elements can trigger resize, which redistributes elements to different buckets, changing iteration order. If you need insertion order, use LinkedHashSet, which maintains a doubly-linked list through all entries in addition to the hash table.

Question #4: What happens if you modify an object after adding it to a HashSet?

What interviewers want to hear: Understanding of the mutable element trap.

Strong answer: If modification changes the object's hashCode(), the object becomes "lost" in the HashSet. It's still physically present in some bucket, but that bucket no longer matches its current hashCode. Operations like contains(), remove(), and even iteration may behave unexpectedly. The object might appear during iteration (walking all buckets) but contains() returns false (looking in wrong bucket). This is why HashSet elements should be immutable, or at least their fields used in hashCode() should never change after insertion.

Question #5: What's the difference between HashSet and TreeSet?

What interviewers want to hear: Understanding of different Set implementations.

Strong answer: HashSet uses a hash table (HashMap internally), providing O(1) average operations but no ordering guarantees. TreeSet uses a red-black tree (TreeMap internally), providing O(log n) operations but maintaining sorted order. HashSet requires hashCode() and equals(); TreeSet requires Comparable or a Comparator. Choose HashSet for fastest operations when order doesn't matter; choose TreeSet when you need sorted iteration or NavigableSet methods like floor(), ceiling(), subSet().

Question #6: When should you use EnumSet instead of HashSet?

What interviewers want to hear: Knowledge of specialized collections.

Strong answer: Always use EnumSet for sets of enum values. EnumSet uses a bit vector representation - each enum value maps to a bit position, so a set of up to 64 enum values fits in a single long. This makes operations like add, remove, and contains simple bit operations - O(1) with tiny constants. Memory usage is minimal (8 bytes vs hundreds for HashSet). EnumSet also provides useful factory methods like EnumSet.allOf(), complementOf(), and range(). The only reason to use HashSet is when you need the exact HashSet class for some API requirement.

🏁 Conclusion

HashSet is deceptively simple - it's just a HashMap wrapper. But understanding this relationship unlocks deeper knowledge:

Every HashSet operation is a HashMap operation
Uniqueness depends on equals() AND hashCode()
Mutable elements are dangerous
Iteration order is unpredictable

Key takeaways:

HashSet = HashMap with dummy values - understand one, understand both
Override both equals() AND hashCode() - missing either breaks uniqueness
Use immutable elements - mutation causes "lost" objects
Pre-size large sets - inherits HashMap's resize behavior
Choose the right Set - LinkedHashSet for order, TreeSet for sorting, EnumSet for enums

Your next step: Continue to Part 10 (TreeSet) to understand sorted sets and NavigableSet operations, or Part 11 (Specialized Sets) to explore EnumSet and CopyOnWriteArraySet.

📅 Review Schedule for This Article

Day	Task	Time
Day 1	Review HashSet = HashMap relationship	5 min
Day 3	Redo Exercise 2 (Find Duplicates)	15 min
Day 7	Answer interview questions without looking	10 min
Day 14	Redo Debug This exercise	10 min
Day 30	Review Set comparison table	5 min

Next in series: [Part 10: TreeSet & NavigableSet Operations]

HashSet Internals

📋 At a Glance

🎯 What You'll Learn

🔥 Production Story: The Duplicate Detection Failure

🧠 Mental Model: HashSet = HashMap with Dummy Values

🔬 Deep Dive

1. The Internal Structure

2. Core Operations

3. Why PRESENT and Not Null?

4. Iteration Order: Why It's Unpredictable

5. Set Algebra Operations

6. Initial Capacity Planning

⚠️ Common Mistakes

Mistake 1: Mutable Elements in HashSet

Mistake 2: Incorrect equals/hashCode Leading to "Duplicates"

Mistake 3: Expecting Iteration Order

Mistake 4: Inefficient Duplicate Detection

Mistake 5: Using HashSet with Non-Hashable Types

🐛 Debug This: The Mystery Duplicates

💻 Exercises

Exercise 1: Efficient Deduplication

Exercise 2: Find Duplicates Between Lists

Exercise 3: Set Algebra Operations

Exercise 4: Custom Hash Distribution Analyzer

Exercise 5: EnumSet Comparison

📝 Summary & Key Takeaways