Java

HashSet Internals

HashSet is Java's primary implementation of a Set - a collection that contains no duplicate elements. But here's the secret: HashSet is just a HashMap in disguise. Understanding this relationship unlocks both Set and Map knowledge simultaneously.

This article explores how HashSet enforces uniqueness, why iteration order is unpredictable, and the common pitfalls that cause "duplicates" to appear in what should be unique collections.

📋 At a Glance

AspectDetails
Internal ImplementationBacked by HashMap<E, Object>
Uniqueness MechanismUses HashMap's key uniqueness
Time ComplexityO(1) average for add/remove/contains
Iteration OrderUndefined (hash-based, not predictable)
Null ElementsAllows one null element
Thread SafetyNot thread-safe

🎯 What You'll Learn

After reading this article, you will be able to:

  • Explain the HashMap backing: How HashSet uses HashMap internally
  • Understand uniqueness: Why equals() and hashCode() are critical
  • Perform set operations: Union, intersection, difference efficiently
  • Avoid common traps: Mutable elements, incorrect equals/hashCode
  • Choose the right Set: HashSet vs TreeSet vs LinkedHashSet

🔥 Production Story: The Duplicate Detection Failure

An e-commerce platform needed to prevent duplicate orders from being processed. The solution seemed simple:

JAVA(14 lines)
Code
Loading syntax highlighter...
The bug: Duplicate orders were being processed anyway!
JAVA(23 lines)
Code
Loading syntax highlighter...
What happened:
JAVA(9 lines)
Code
Loading syntax highlighter...
The fix:
JAVA(17 lines)
Code
Loading syntax highlighter...
Lesson: HashSet uniqueness depends entirely on your equals/hashCode implementation. Use only immutable, identifying fields.

🧠 Mental Model: HashSet = HashMap with Dummy Values

The simplest way to understand HashSet is to see what it actually is:

TEXT(31 lines)
Code
Loading syntax highlighter...
Key insight: HashSet is a thin wrapper that delegates everything to HashMap. If you understand HashMap (Part 12-13), you already understand HashSet.

🔬 Deep Dive

1. The Internal Structure

JAVA(28 lines)
Code
Loading syntax highlighter...

2. Core Operations

Every HashSet operation is a HashMap operation:

JAVA(39 lines)
Code
Loading syntax highlighter...

3. Why PRESENT and Not Null?

You might wonder: why use a dummy PRESENT object instead of null?
JAVA(10 lines)
Code
Loading syntax highlighter...
JAVA(7 lines)
Code
Loading syntax highlighter...

4. Iteration Order: Why It's Unpredictable

HashSet iteration follows HashMap's bucket order:

JAVA(12 lines)
Code
Loading syntax highlighter...
Why? The iterator walks through the bucket array sequentially, returning elements from each non-empty bucket. The order depends on:
  1. Hash codes of elements
  2. Current capacity of the map
  3. Hash function's bit mixing
TEXT(12 lines)
Code
Loading syntax highlighter...
If you need predictable order:
  • Insertion order: Use LinkedHashSet
  • Sorted order: Use TreeSet

5. Set Algebra Operations

HashSet supports standard set operations through Collection methods:

JAVA(25 lines)
Code
Loading syntax highlighter...
Performance note: These operations are O(n) where n is the size of the smaller set (for intersection/difference) or both sets (for union).

6. Initial Capacity Planning

Like HashMap, HashSet benefits from proper initial capacity:

JAVA(10 lines)
Code
Loading syntax highlighter...
Formula: capacity = (int)(expectedSize / loadFactor) + 1

With default load factor 0.75:

  • 1,000 elements → capacity ~1,334
  • 10,000 elements → capacity ~13,334
  • 1,000,000 elements → capacity ~1,333,334

⚠️ Common Mistakes

Mistake 1: Mutable Elements in HashSet

Problem:
JAVA(12 lines)
Code
Loading syntax highlighter...
Solution:
JAVA(16 lines)
Code
Loading syntax highlighter...

Mistake 2: Incorrect equals/hashCode Leading to "Duplicates"

Problem:
JAVA(19 lines)
Code
Loading syntax highlighter...
Solution:
JAVA(22 lines)
Code
Loading syntax highlighter...

Mistake 3: Expecting Iteration Order

Problem:
JAVA(8 lines)
Code
Loading syntax highlighter...
Solution:
JAVA(15 lines)
Code
Loading syntax highlighter...

Mistake 4: Inefficient Duplicate Detection

Problem:
JAVA(10 lines)
Code
Loading syntax highlighter...
Solution:
JAVA(11 lines)
Code
Loading syntax highlighter...

Mistake 5: Using HashSet with Non-Hashable Types

Problem:
JAVA(6 lines)
Code
Loading syntax highlighter...
Solution:
JAVA(19 lines)
Code
Loading syntax highlighter...

🐛 Debug This: The Mystery Duplicates

A developer reports: "I'm using HashSet but still getting duplicates!"

JAVA(31 lines)
Code
Loading syntax highlighter...
Why are there duplicates? Find and fix the bug.

✅ Solution:
The developer implemented equals() but not hashCode(). The toString() method is irrelevant for Set behavior.
Without a proper hashCode():
  1. Each CacheKey uses Object.hashCode() (memory address)
  2. Two equal CacheKeys have different hash codes
  3. They land in different buckets
  4. HashSet never calls equals() because they're in different buckets
JAVA(36 lines)
Code
Loading syntax highlighter...
Better approach - use Record:
JAVA(3 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Efficient Deduplication

⭐⭐ Difficulty: Medium | ⏱️ Time: 10 minutes

Task: Remove duplicates from a list while preserving the order of first occurrences.
JAVA(10 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(36 lines)
Code
Loading syntax highlighter...

Exercise 2: Find Duplicates Between Lists

⭐⭐ Difficulty: Medium | ⏱️ Time: 15 minutes

Task: Find elements that appear in both lists efficiently.
JAVA(11 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(46 lines)
Code
Loading syntax highlighter...

Exercise 3: Set Algebra Operations

⭐⭐⭐ Difficulty: Medium | ⏱️ Time: 15 minutes

Task: Implement all standard set operations.
JAVA(9 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(78 lines)
Code
Loading syntax highlighter...

Exercise 4: Custom Hash Distribution Analyzer

⭐⭐⭐ Difficulty: Medium-Hard | ⏱️ Time: 20 minutes

Task: Create a utility to analyze how well elements are distributed across HashSet buckets.
JAVA(9 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(83 lines)
Code
Loading syntax highlighter...

Exercise 5: EnumSet Comparison

⭐⭐ Difficulty: Medium | ⏱️ Time: 10 minutes

Task: Compare HashSet and EnumSet for enum values.
JAVA(11 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(62 lines)
Code
Loading syntax highlighter...

📝 Summary & Key Takeaways

HashSet Internals

TEXT(12 lines)
Code
Loading syntax highlighter...

Performance Characteristics

OperationAverageWorst CaseNotes
add(E)O(1)O(n)Worst with many collisions
remove(E)O(1)O(n)Worst with many collisions
contains(E)O(1)O(n)Worst with many collisions
size()O(1)O(1)Just returns field
isEmpty()O(1)O(1)Checks size == 0
iterator.next()O(1)*O(n)*Amortized across full iteration

When to Use Which Set

Use CaseBest SetWhy
Default, no orderingHashSetO(1) operations
Maintain insertion orderLinkedHashSetO(1) + predictable iteration
Sorted iterationTreeSetO(log n) but sorted
Enum valuesEnumSetBit vector, extremely fast
Thread-safeConcurrentHashMap.newKeySet()Concurrent access
ImmutableSet.of()No modification possible

Critical Rules

  1. Always override both equals() AND hashCode() for elements
  2. Use immutable elements or never modify after adding
  3. Pre-size for large sets to avoid resize overhead
  4. Don't rely on iteration order - use LinkedHashSet if needed
  5. Use EnumSet for enum values - much more efficient

🎤 Senior-Level Interview Questions

Question #1: How does HashSet ensure uniqueness?

What interviewers want to hear: Understanding that HashSet is backed by HashMap.
Strong answer: HashSet uses a HashMap internally where elements are keys and values are a constant dummy object (PRESENT). When you call add(element), it actually calls map.put(element, PRESENT). HashMap ensures key uniqueness - if the key already exists, put() returns the old value (PRESENT), and HashSet's add() returns false. If the key is new, put() returns null, and add() returns true. The uniqueness check uses hashCode() to find the bucket, then equals() to check for duplicates in that bucket.

Question #2: What is the time complexity of add/remove/contains in HashSet?

What interviewers want to hear: Understanding of average vs worst case.
Strong answer: Average case is O(1) for all three operations because they're backed by HashMap's O(1) bucket lookup. However, worst case is O(n) when all elements hash to the same bucket (hash collision attack scenario). In Java 8+, this worst case improves to O(log n) due to tree bins that activate when a bucket has 8+ entries. The O(1) depends on a good hash function distributing elements across buckets evenly.

Question #3: Why doesn't HashSet maintain insertion order?

What interviewers want to hear: Understanding of hash-based storage.
Strong answer: HashSet stores elements in buckets determined by hash code, not insertion order. The iterator walks through the bucket array sequentially, returning elements from each non-empty bucket. This order depends on hash codes and current capacity - adding more elements can trigger resize, which redistributes elements to different buckets, changing iteration order. If you need insertion order, use LinkedHashSet, which maintains a doubly-linked list through all entries in addition to the hash table.

Question #4: What happens if you modify an object after adding it to a HashSet?

What interviewers want to hear: Understanding of the mutable element trap.
Strong answer: If modification changes the object's hashCode(), the object becomes "lost" in the HashSet. It's still physically present in some bucket, but that bucket no longer matches its current hashCode. Operations like contains(), remove(), and even iteration may behave unexpectedly. The object might appear during iteration (walking all buckets) but contains() returns false (looking in wrong bucket). This is why HashSet elements should be immutable, or at least their fields used in hashCode() should never change after insertion.

Question #5: What's the difference between HashSet and TreeSet?

What interviewers want to hear: Understanding of different Set implementations.
Strong answer: HashSet uses a hash table (HashMap internally), providing O(1) average operations but no ordering guarantees. TreeSet uses a red-black tree (TreeMap internally), providing O(log n) operations but maintaining sorted order. HashSet requires hashCode() and equals(); TreeSet requires Comparable or a Comparator. Choose HashSet for fastest operations when order doesn't matter; choose TreeSet when you need sorted iteration or NavigableSet methods like floor(), ceiling(), subSet().

Question #6: When should you use EnumSet instead of HashSet?

What interviewers want to hear: Knowledge of specialized collections.
Strong answer: Always use EnumSet for sets of enum values. EnumSet uses a bit vector representation - each enum value maps to a bit position, so a set of up to 64 enum values fits in a single long. This makes operations like add, remove, and contains simple bit operations - O(1) with tiny constants. Memory usage is minimal (8 bytes vs hundreds for HashSet). EnumSet also provides useful factory methods like EnumSet.allOf(), complementOf(), and range(). The only reason to use HashSet is when you need the exact HashSet class for some API requirement.

🏁 Conclusion

HashSet is deceptively simple - it's just a HashMap wrapper. But understanding this relationship unlocks deeper knowledge:

  • Every HashSet operation is a HashMap operation
  • Uniqueness depends on equals() AND hashCode()
  • Mutable elements are dangerous
  • Iteration order is unpredictable
Key takeaways:
  1. HashSet = HashMap with dummy values - understand one, understand both
  2. Override both equals() AND hashCode() - missing either breaks uniqueness
  3. Use immutable elements - mutation causes "lost" objects
  4. Pre-size large sets - inherits HashMap's resize behavior
  5. Choose the right Set - LinkedHashSet for order, TreeSet for sorting, EnumSet for enums
Your next step: Continue to Part 10 (TreeSet) to understand sorted sets and NavigableSet operations, or Part 11 (Specialized Sets) to explore EnumSet and CopyOnWriteArraySet.

📅 Review Schedule for This Article

DayTaskTime
Day 1Review HashSet = HashMap relationship5 min
Day 3Redo Exercise 2 (Find Duplicates)15 min
Day 7Answer interview questions without looking10 min
Day 14Redo Debug This exercise10 min
Day 30Review Set comparison table5 min

Next in series: [Part 10: TreeSet & NavigableSet Operations]