Java

Streams and Collectors

Master the Stream API for powerful collection transformations. Learn Collector internals, avoid toMap() pitfalls, write custom collectors, and understand when parallel streams help versus hurt performance.

📋 At a Glance

AspectDetails
TopicStream API, Collectors, groupingBy, toMap, parallel streams
ComplexityIntermediate to Advanced
PrerequisitesPart 1 (Collection Architecture), Part 2 (Generics)
Time to Master4-5 hours
Interview FrequencyVery High (functional programming, data transformation)

🎯 What You'll Learn

After completing this article, you will be able to:

  1. Transform collections efficiently with Stream operations
  2. Use Collectors for complex aggregations
  3. Avoid common toMap() and groupingBy() pitfalls
  4. Write custom Collectors for specialized needs
  5. Decide when parallel streams improve performance

Production Story: The toMap() Crash

The Incident

Our data import service crashed during the nightly batch job. The culprit was a simple-looking stream operation:

JAVA(10 lines)
Code
Loading syntax highlighter...

The Problem

TEXT(17 lines)
Code
Loading syntax highlighter...

The Stream Solution

JAVA(39 lines)
Code
Loading syntax highlighter...

The Difference

TEXT(10 lines)
Code
Loading syntax highlighter...

Mental Model: The Assembly Line

TEXT(62 lines)
Code
Loading syntax highlighter...

Deep Dive: Collection to Stream and Back

Creating Streams from Collections

JAVA(21 lines)
Code
Loading syntax highlighter...

Collecting Back to Collections

JAVA(18 lines)
Code
Loading syntax highlighter...

Deep Dive: Collectors.toMap()

Basic toMap Usage

JAVA(13 lines)
Code
Loading syntax highlighter...

Handling Duplicates (The Critical Part!)

JAVA(33 lines)
Code
Loading syntax highlighter...

Specifying Map Implementation

JAVA(25 lines)
Code
Loading syntax highlighter...

Deep Dive: groupingBy and partitioningBy

Basic Grouping

JAVA(9 lines)
Code
Loading syntax highlighter...

Downstream Collectors

JAVA(42 lines)
Code
Loading syntax highlighter...

Nested Grouping

JAVA(17 lines)
Code
Loading syntax highlighter...

partitioningBy (Binary Split)

JAVA(14 lines)
Code
Loading syntax highlighter...

Deep Dive: Advanced Collectors

joining()

JAVA(18 lines)
Code
Loading syntax highlighter...

collectingAndThen()

JAVA(28 lines)
Code
Loading syntax highlighter...

reducing()

JAVA(19 lines)
Code
Loading syntax highlighter...

teeing() (Java 12+)

JAVA(22 lines)
Code
Loading syntax highlighter...

Deep Dive: Writing Custom Collectors

Collector Interface

JAVA(12 lines)
Code
Loading syntax highlighter...

Custom Collector: ImmutableList

JAVA(15 lines)
Code
Loading syntax highlighter...

Custom Collector: Running Statistics

JAVA(42 lines)
Code
Loading syntax highlighter...

Custom Collector: Top N

JAVA(29 lines)
Code
Loading syntax highlighter...

Deep Dive: Parallel Streams

When to Use Parallel Streams

JAVA(19 lines)
Code
Loading syntax highlighter...

Parallel Stream Pitfalls

JAVA(34 lines)
Code
Loading syntax highlighter...

Measuring Parallel Performance

JAVA(21 lines)
Code
Loading syntax highlighter...

⚠️ Common Mistakes

Mistake 1: toMap() Without Merge Function

JAVA(15 lines)
Code
Loading syntax highlighter...

Mistake 2: Modifying Source During Stream

JAVA(14 lines)
Code
Loading syntax highlighter...

Mistake 3: Assuming Parallel is Faster

JAVA(15 lines)
Code
Loading syntax highlighter...

Mistake 4: Using peek() for Side Effects

JAVA(14 lines)
Code
Loading syntax highlighter...

Mistake 5: Ignoring Optional in Collectors

JAVA(14 lines)
Code
Loading syntax highlighter...

🐛 Debug This

Challenge 1: The Empty Map

JAVA(12 lines)
Code
Loading syntax highlighter...
✅ Answer:
You get {cat=2, dog=1, bird=1} - it works correctly! The merge function Integer::sum adds up values for duplicate keys.
This is actually the RIGHT way to count occurrences with toMap().

However, a simpler approach:

JAVA(5 lines)
Code
Loading syntax highlighter...

Challenge 2: The Lost Elements

JAVA(8 lines)
Code
Loading syntax highlighter...
✅ Answer:
Result is non-deterministic! Could be [6, 7, 8] or [8, 9, 10] or [6, 9, 10], etc.
limit() with parallel streams doesn't guarantee which elements are kept - just that 3 are kept.
Fix:
JAVA(6 lines)
Code
Loading syntax highlighter...

Challenge 3: The Mysterious Null

JAVA(7 lines)
Code
Loading syntax highlighter...
✅ Answer:
NullPointerException! groupingBy doesn't allow null keys.
Fix:
JAVA(10 lines)
Code
Loading syntax highlighter...

💻 Exercises

Exercise 1: Multi-level Aggregation

Create a report showing average salary by department and seniority level:

JAVA(4 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(26 lines)
Code
Loading syntax highlighter...

Exercise 2: Custom Collector - Distinct Count Per Group

Write a collector that counts distinct values per group:

JAVA(11 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(40 lines)
Code
Loading syntax highlighter...

Exercise 3: Pagination Collector

Create a collector that paginates results:

JAVA(5 lines)
Code
Loading syntax highlighter...
✅ Solution:
JAVA(40 lines)
Code
Loading syntax highlighter...

🎤 Senior-Level Interview Questions

Question 1: toMap vs groupingBy

Q: When would you use toMap() vs groupingBy()? What's the key difference?
A:
AspecttoMap()groupingBy()
ResultMap<K, V>Map<K, List<V>>
DuplicatesMust handle explicitlyNaturally groups duplicates
Use caseUnique key per elementMultiple elements per key
JAVA(7 lines)
Code
Loading syntax highlighter...

Question 2: Collector Components

Q: Explain the four methods of the Collector interface and when each is called.
A:
JAVA(15 lines)
Code
Loading syntax highlighter...

Question 3: Parallel Stream Overhead

Q: Why can parallel streams be slower than sequential for small collections?
A:

Parallel streams have overhead:

  1. Splitting: Source must be split into chunks
  2. Thread management: ForkJoinPool coordination
  3. Combining: Results from threads must be merged
  4. Memory: Each thread needs its own accumulator
JAVA(15 lines)
Code
Loading syntax highlighter...

Question 4: Stream Reuse

Q: Can you reuse a stream? What happens if you try?
A:

No, streams can only be consumed once:

JAVA(18 lines)
Code
Loading syntax highlighter...

Question 5: flatMap vs map

Q: What's the difference between map() and flatMap()? Give an example.
A:
JAVA(27 lines)
Code
Loading syntax highlighter...

📝 Summary & Key Takeaways

Essential Collectors

CollectorPurposeExample
toList()Collect to Liststream.collect(toList())
toSet()Collect to Setstream.collect(toSet())
toMap()Collect to Mapstream.collect(toMap(k, v, merge))
groupingBy()Group by keystream.collect(groupingBy(classifier))
partitioningBy()Binary splitstream.collect(partitioningBy(predicate))
joining()Concatenate stringsstream.collect(joining(", "))
counting()Count elementsgroupingBy(x, counting())
summingInt()Sum valuesgroupingBy(x, summingInt(fn))
mapping()Transform in groupgroupingBy(x, mapping(fn, toList()))

Key Rules

  1. Always use merge function with toMap() - duplicate keys are common
  2. groupingBy doesn't allow null keys - filter or transform nulls
  3. Parallel streams need stateless operations - avoid shared mutable state
  4. Measure before parallelizing - overhead can exceed benefit
  5. Streams are single-use - create new stream for each operation

🏁 Conclusion

Streams and Collectors provide powerful tools for collection transformation, but their complexity can lead to subtle bugs. The key insights are:

  1. toMap() is dangerous without merge function - always specify one
  2. groupingBy with downstream collectors enables complex aggregations
  3. Parallel streams have overhead - measure before assuming they're faster
  4. Custom collectors solve specialized needs elegantly
  5. Stream pipeline order matters - filter early, transform late

In the next article, we'll explore Views, Wrappers, and Defensive Patterns - techniques for protecting your collections from unintended modification.


📅 Review Schedule

To solidify your understanding, review this material:

  • Tomorrow: Practice toMap() with merge functions
  • In 3 days: Write a groupingBy with nested downstream collectors
  • In 1 week: Implement a custom Collector
  • In 2 weeks: Benchmark sequential vs parallel for your use case