Goroutines: the concurrency model we wanted all along

Published on 2023-07-02
Edited on 2023-07-07
Tagged: concurrency go java

View All Posts

I often hear criticisms of Go that seem superficial: the variable names are too short, the functions are not overloadable, the exporting-symbols-by-capitalizing thing is weird, and until recently, generics are non-existent. Go is an idiosyncratic language, and of course I have my own opinions and my own set of gripes too, but today I want to highlight one of the features I like most about the language: goroutines.

Goroutines are Go's main concurrency primitives. They look very much like threads, but they are cheap to create and manage. Go's runtime schedules goroutines onto real threads efficiently to avoid wasting resources, so you can easily create lots of goroutines (like one goroutine per request), and you can write simple, imperative, blocking code. Consequently, Go networking code tends to be straightforward and easier to reason about than the equivalent code in other languages.

For me, goroutines are the single feature that distinguishes Go from other languages. They are why I prefer Go for writing code that requires any amount of parallelism.

Before we talk more about goroutines, let's cover some history so it makes sense why you'd want them.

Forked and threaded servers

High performance servers need to handle requests from many clients at the same time. There are many ways to design a server to handle that.

The simplest naïve design is to have a main process that calls accept in a loop, then calls fork to create a child process that handles the request.

I learned about this pattern from the excellent Beej's Guide to Network Programming in the early 2000s when I was a student. fork is a nice pattern to use when learning network programming since you can focus on networking and not server architecture. It's hard to write an efficient server following this pattern though, and I don't think anyone uses it in practice anymore.

There are a lot of problems with fork. First is cost: a fork call on Linux appears to be fast, but it marks all your memory as copy-on-write. Each write to a copy-on-write page causes a minor page fault, a small delay that's hard to measure. Context switching between processes is also expensive. Another problem is scale: it is difficult to coordinate use of shared resources like CPU, memory, database connections, and whatever else among a large number of child processes. If you get a surge of traffic and you create too many processes, they'll contend with each other for the CPU. If you limit the number of processes you create, then a large number of slow clients can block your service for everyone while your CPU sits idle. Careful use of timeouts helps (and is necessary regardless of server architecture).

These problems are somewhat mitigated by using threads instead of processes. A thread is cheaper to create than a process since it shares memory and most other resources. It's also relatively easy to communicate between threads in a shared address space, using semaphores and other constructs to manage shared resources.

Threads do still have a substantial cost though, and you'll run into scaling problems if you create a new thread for every connection. As with processes, you need to limit the number of running threads to avoid heavy CPU contention, and you need to time out slow requests. Creating a new thread still takes time, though you can mitigate that by recycling threads across requests with a thread pool.

Whether you're using processes or threads, you still have a question that's difficult to answer: how many should you create? If you allow an unlimited number of threads, clients can use up all your memory and CPU with a small surge in traffic. If you limit your server to a maximum number of threads, then a bunch of slow clients can clog up your server. Timeouts help, but it's still difficult use your hardware resources efficiently.

Event-driven servers

Since we can't easily predict how many threads we'll need, what happens when we try to decouple requests from threads? What if we have just one thread dedicated to application logic (or perhaps a small, fixed number of threads), and we handle all the network traffic in the background using asynchronous system calls? This is an event-driven server architecture.

Event-driven servers were designed around the select system call. (Later mechanisms like poll have replaced select, but select is widely known, and they all serve the same conceptual purpose here.) select accepts a list of file descriptors (generally sockets) and returns which ones are ready to read or write. If none of the file descriptors are ready, select blocks until at least one is.

#include <sys/select.h>

int select(int nfds, fd_set *restrict readfds,
           fd_set *restrict writefds, fd_set *restrict exceptfds,
           struct timeval *restrict timeout);
#include <poll.h>

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

(I want to elaborate a bit on what it means for a socket to be "ready" because I had some trouble understanding this initially. Each socket has a kernel buffer for receiving and a buffer for sending. When your computer receives a packet, the kernel stores its data in the receive buffer for the appropriate socket until your program gets it with recv. Likewise, when your program calls send, the kernel stores the data in the socket's send buffer until it can be transmitted. A socket is ready to receive if there's data in the receive buffer. It's ready to send if there's available space in the send buffer. If the socket is not ready, recv and send block until it is, by default.)

To implement an event-driven server, you track a socket and some state for each request that's blocked on the network. The server has a single main event loop where it calls select with all those blocked sockets. When select returns, the server knows which requests can make progress, so for each request, it calls into the application logic with the stored state. When the application needs to use the network again, it adds the socket back to the "blocked" pool together with new state. The state here can be anything the application needs to resume what it was doing: a closure to be called back, or a promise to be completed.

This can all technically be implemented with a single thread. I can't speak to the details of any particular implementation, but languages that lack threading like JavaScript follow this model pretty closely. Node.js describes itself as "an event-driven JavaScript runtime ... designed to build scalable network applications." This is exactly what they mean.

Event-driven servers can generally make better use of CPU and memory than purely fork- or thread-based servers. You can spawn an application thread per core to handle requests in parallel. The threads don't contend with each other for CPU since the number of threads equals the number of cores. The threads are never idle when there are requests that can make progress. Efficient. So efficient that it's hard to justify writing a server any other way these days.

It sounds like a good idea on paper anyway, but it's a nightmare to write application code that works like this. The specific way in which it's a nightmare depends on the language and framework you're using. In JavaScript, asynchronous functions typically return a Promise, to which you attach callbacks. In Java gRPC, you're dealing with StreamObserver. If you're not careful, you end up with lots of deeply nested "arrow code" functions. If you are careful, you split up your functions and classes, obfuscating your control flow. Either way, you're in callback hell.

To show what I mean, below is an example from the Java gRPC Basics Tutorial. Does the control flow here make sense?

public void routeChat() throws Exception {
  info("*** RoutChat");
  final CountDownLatch finishLatch = new CountDownLatch(1);
  StreamObserver<RouteNote> requestObserver =
      asyncStub.routeChat(new StreamObserver<RouteNote>() {
        @Override
        public void onNext(RouteNote note) {
          info("Got message \"{0}\" at {1}, {2}", note.getMessage(), note.getLocation()
              .getLatitude(), note.getLocation().getLongitude());
        }

        @Override
        public void onError(Throwable t) {
          Status status = Status.fromThrowable(t);
          logger.log(Level.WARNING, "RouteChat Failed: {0}", status);
          finishLatch.countDown();
        }

        @Override
        public void onCompleted() {
          info("Finished RouteChat");
          finishLatch.countDown();
        }
      });

  try {
    RouteNote[] requests =
        {newNote("First message", 0, 0), newNote("Second message", 0, 1),
            newNote("Third message", 1, 0), newNote("Fourth message", 1, 1)};

    for (RouteNote request : requests) {
      info("Sending message \"{0}\" at {1}, {2}", request.getMessage(), request.getLocation()
          .getLatitude(), request.getLocation().getLongitude());
      requestObserver.onNext(request);
    }
  } catch (RuntimeException e) {
    // Cancel RPC
    requestObserver.onError(e);
    throw e;
  }
  // Mark the end of requests
  requestObserver.onCompleted();

  // Receiving happens asynchronously
  finishLatch.await(1, TimeUnit.MINUTES);
}

This is from the beginner tutorial, and it's not a complete example either. The sending code is synchronous while the receiving code is asynchronous.

In Java, you may be dealing different with asynchronous types for your HTTP server, gRPC, SQL database, cloud SDK and whatever else, and you need adapters between all of them. It gets to be a mess very quickly. Locks are dangerous, too. You need to be careful about holding locks across network calls. It's also easy to make a mistake with locking and callbacks. For example, if a synchronized method calls a function that returns a ListenableFuture then attaches an inline callback, the callback also needs a synchronized block, even though it's nested inside the parent method.

goroutines

What was this post about again? Oh yes, goroutines.

A goroutine is Go's version of a thread. Like threads in other languages, each goroutine has its own stack. Goroutines may execute in parallel, concurrently with other goroutines. Unlike threads, goroutines are very cheap to create: they aren't bound to an OS thread, and their stacks start out very small (2 KiB) but can grow as needed. When you create a goroutine, you're essentially allocating a closure and adding it to a queue in the runtime.

Internally, Go's runtime has a set of OS threads that execute goroutines (normally one thread per core). When a thread is available and a goroutine is ready to run, the runtime schedules the goroutine onto the thread, executing application logic. If a goroutine blocks on something like a mutex or channel, the runtime adds it to a set of blocked goroutines then schedules the next ready goroutine onto the same OS thread. This also applies to the network: when a goroutine sends or receives data on a socket that's not ready, it yields its OS thread to the scheduler.

Sound familiar? Go's scheduler acts a lot like the main loop in an event-driven server. Except instead of relying solely on select and focusing on file descriptors, the scheduler handles everything in the language that might block. You no longer need to avoid blocking calls because the scheduler makes efficient use of the CPU either way. You're free to spawn lots of goroutines (one per request!) because they're cheap to create, and the threads don't contend for the CPU (minimal context switching). You don't need to worry about thread pools and executor services because the runtime effectively has one big thread pool.

In short, you can write simple blocking application code in a clean imperative style as if you were writing a thread-based server, but you keep all the efficiency advantages of an event-driven server. Best of both worlds. This kind of code composes well across frameworks. You don't need adapters between your StreamObservers and ListenableFutures.

Let's take a look at the same example from the Go gRPC Basics Tutorial. I find the control flow here easier to comprehend than the Java example because both the sending and receiving code are synchronous. In both goroutines, we're able to call stream.Recv and stream.Send in a for loop. No need for callbacks, subclasses, or executors.

stream, err := client.RouteChat(context.Background())
waitc := make(chan struct{})
go func() {
  for {
    in, err := stream.Recv()
    if err == io.EOF {
      // read done.
      close(waitc)
      return
    }
    if err != nil {
      log.Fatalf("Failed to receive a note : %v", err)
    }
    log.Printf("Got message %s at point(%d, %d)", in.Message, in.Location.Latitude, in.Location.Longitude)
  }
}()
for _, note := range notes {
  if err := stream.Send(note); err != nil {
    log.Fatalf("Failed to send a note: %v", err)
  }
}
stream.CloseSend()
<-waitc

async / await, virtual threads

Go is a fairly new language, and it was designed at a time when event-driven servers were already popular and well-understood. Java, C++, Python, and other languages are older and don't have that same luxury. So as a language designer, what features can you add to make writing asynchronous application code less painful?

async / await has become the main solution in most languages. Wikipedia tells me that it was first added to F# in 2007 (around the same time Go was being developed). It's now in C#, C++, JavaScript, Python, Rust, Swift, and a bunch of other languages.

async / await let you write code that works with asynchronous functions in a style that resembles imperative blocking code. An asynchronous function is marked with the async keyword and returns a promise (or whatever the language equivalent is). When you call an asynchronous function, you can "block" and get the value in the promise with the await keyword. If a function uses await, it must also be marked async, and the compiler rewrites it to return a promise. There are different ways to implement async and await, but the feature generally implies a language has support for cooperatively scheduled coroutines if not full support for threads. When a coroutine awaits a promise that's not completed yet, it yields to the language's runtime, which may schedule another ready coroutine on the same thread.

I'm pleased that so many languages have adopted async / await. Since I've worked primarily in Go, Java, and C++, I haven't personally had much opportunity to use it, but whenever I need to do something in JavaScript, I'm really glad it's there. await control flow is much easier to understand than asynchronous callback code. It is a bit annoying though to annotate functions with async. Bob Nystrom's What Color is Your Function is a humorous criticism of that.

Java is conspicuously absent from the list of languages above. Until now, you've had to either spawn an unreasonable number of threads or deal with Java's particular circle of callback hell. Happily, JEP 444 adds virtual threads, which sound a lot like goroutines. Virtual threads are cheap to create. The JVM schedules them onto platform threads (real threads the kernel knows about). There are a fixed number of platform threads, generally one per core. When a virtual thread performs a blocking operation, it releases its platform thread, and the JVM may schedule another virtual thread onto it. Unlike goroutines, virtual thread scheduling is cooperative: a virtual thread doesn't yield to the scheduler until it performs a blocking operation. This means that a tight loop can hold a thread indefinitely. I don't know whether this is an implementation limitation or if there's a deeper issue. Go used to have this problem until fully preemptive scheduling was implemented in 1.14.

Virtual threads are available for preview in the current release and are expected to become stable in JDK 21 (due September 2023). I'm looking forward to deleting a lot of ListenableFutures, but it's unclear to me how virtual threads will interact with existing frameworks designed around a more purely event-driven model. Whenever a new language or runtime feature is introduced, there's a long cultural migration period, and I think the Java ecosystem is pretty conservative in that regard. Still, I'm encouraged by the enthusiastic adoption of async / await in other languages. I hope Java will be similar with virtual threads.