A brief explanation of Scala 2.8 collections

Published on 2011-06-04
Tagged: scala

View All Posts

The new collections in Scala 2.8.0 are awesome, but their design is kind of confusing. Check out the type for the simple function List.map:

def map[B, That](f: (A) ⇒ B)
                (implicit bf: CanBuildFrom[List[A], B, That]): That

In another language like OCaml, we would have a much simpler type:

('a -> 'b) -> 'a list -> 'b list

So why the extra complication? What does that CanBuildFrom do? Much smarter people than me have written excellent explanation already. Martin Odersky has written some very thorough documentation on collections. Daniel C. Sobral also have some amazing Stack Overflow answers like this one. I will attempt a very concise explanation though.

Motivation

Unlike OCaml, Scala has a lot of collections. You have Lists, Sets, Maps, each with pretty much every implementation you can think of. And there are mutable and immutable versions of each. These collections all support dozens of standard methods like map, flatMap, and filter. Writing an implementation of every standard method for every collection would be a pain, and it seems unnecessary since the logic in each standard method is pretty much the same.

There are two major obstacles to having a single collection agnostic implementation of each standard method. First, we lack an abstract way to construct a result. List.map needs to construct a List, Set.map needs to construct a Set, and so on. Second, we don't know what the return type should be. This is how we might naïvely define map:

trait Traversable[+A] {
  def map[B](f: A => B): Traversable[B]
}

If we called this on a List, we would get a Traversable result, so we would need to cast it back to List.

Overcoming obstacles

Let's look at the Traversable.map type closely to see how both problems are solved in the collection library. We have the following type parameters:

You'll notice that there is an implicit parameter of type CanBuildFrom[List[A], B, That]. Here's what CanBuildFrom looks like:

trait CanBuildFrom[-From, -Elem, +To] extends AnyRef {
  def apply(): Builder[Elem, To]
  def apply(from: From): Builder[Elem, To]
}

As you can see, CanBuildFrom is a factory for Builder:

trait Builder[-Elem, +To] extends Growable[Elem] {
  def +=(elem: Elem): Builder[Elem, To]
  def result(): To
  ...
}

A Builder is used to create a new collection. We just create a new Builder using CanBuildFrom.apply, add our elements using +=, and then call result. Each collection defines its own Builder class and a CanBuildFrom factory. Scala's implicit argument handling finds these for us, so we almost never have to pass them explicitly.

Since Builder handles the details of constructing the result, map can delegate all of its collection-specific work to the Builder. This allows a collection agnostic map to be implemented, which solves our first problem. The actual implementation is in TraversableLike.

Our second problem is solved by type parameterization. By simply declaring an extra type parameter for the result type (That), we can always return the correct type without actually knowing what it is. And since Scala has type inference, we rarely need to specify the type parameters explicitly.

Here's one of my favorite examples of this system in action:

scala> val s = BitSet(1, 2, 3)
s: scala.collection.BitSet = BitSet(1, 2, 3)

scala> s.map(_ + 2)
res0: scala.collection.BitSet = BitSet(3, 4, 5)

scala> s.map(_.toString)
res1: scala.collection.Set[java.lang.String] = Set(1, 2, 3)

A BitSet is a set of positive integers represented by a bit field, i.e., bit 5 is set if the integer 5 is in the set. Because of its implementation, BitSet cannot contain anything except integers, so when we map its elements to strings, the result cannot be a BitSet.

How does Scala figure out what to do here? Since we pass map a function of type Int => String, the Scala compiler knows to look for a CanBuildFrom where Elem is String. BitSet provides two instances of CanBuildFrom: one where Elem is Int for building new BitSets, and one where Elem is unspecified for building Sets of any other type. When searching for an implicit argument, the compiler will look for the value with the most specific type. So when we pass an Int => Int function, we get the first builder, but when we pass an Int => String function, we get the second.