Organizing Bazel WORKSPACE files

Published on 2020-09-13
Tagged: bazel

Every Bazel project has a WORKSPACE file in its root directory. WORKSPACE has several functions, but its main purpose is to declare external dependencies using repository rules. WORKSPACE files are syntactically similar to BUILD files used to define targets in the rest of the repository, but they're evaluated very differently.

Declaring a list of dependencies seems straightforward, but when there are a huge number of direct and indirect dependencies declared in various functions, it gets to be a lot to manage. WORSKPACE has some surprising behavior when the same repository is declared more than once, which doesn't help either.

In this article, I'll explain how WORKSPACE is evaluated, then I'll give some guidelines for organizing WORKSPACE files to avoid confusion and ambiguity.

Design issues

Managing the WORKSPACE file is one of the most difficult parts of using Bazel in my opinion. This stems from three main design issues:

WORKSPACE files have very little structure. They're essentially Starlark scripts. If they were more strictly declarative, tools could manage them more easily. However, evaluating a WORKSPACE file executes repository rules, which can run arbitrary commands on the host system. Tools can't do that easily or safely.
WORKSPACE files in external repositories are not evaluated recursively, so you're responsible for declaring not only your direct dependencies but also your indirect dependencies. Bazel doesn't give you tools for listing indirect dependencies or resolving conflicts between multiple declarations.
WORKSPACE files have a surprisingly complicated evaluation model. It's difficult for users to read WORKSPACE and predict what version of each dependency will be actually be used.

These design issues date back to when Bazel was open sourced. WORKSPACE hasn't changed much since then. There have been a number of attempts to improve Bazel's external dependency management, but it's an inherently difficult problem, and people tend to underestimate how much effort it will take to fix it.

How Bazel evaluates a `WORKSPACE` file

I first tried to answer this on StackOverflow. The official documentation explains the semantics to a degree, but it's light on details. So below is my understanding of how this works.

A WORKSPACE file is essentially a list of load statements, repository declarations, and function calls. Bazel evaluates the file line-by-line.

A repository declaration is a call to a repository rule like http_archive or go_repository. Each repository has a name and some information on how to fetch it like URLs and SHA-256 sums. Repository rules are evaluated lazily: at the point where a repository is declared, the repository rule's code isn't actually executed.

A repository is fetched (meaning its repository rule is executed) the first time a file is loaded from it. Several things can cause this while WORKSPACE is being evaluated:

A load statement that mentions a .bzl file in the repository is evaluated. The load statement might appear in WORKSPACE or in another .bzl file loaded from WORKSPACE.
A different repository rule is fetched, and that repository's declaration has an attribute that refers to a file. When a repository is fetched, the labels in its attributes are resolved to files, which may cause other repositories to be fetched. Labels may be part of explicit arguments, or they may be default values for attributes.
A different repository rule could use ctx.path to dynamically resolve a label.

The important thing to understand is that a repository isn't fetched until a label mentioning that repository is resolved to a file. It's difficult to be sure about when that happens because there are several cases where it happens implicitly within repository rule implementations.

This leads to the most confusing aspect of WORKSPACE evaluation:

A repository may be declared with the same name multiple times without error. This does not create multiple instances of the repository. When a repository is fetched, the latest declaration wins. After a repository is fetched, all following declarations are silently ignored.

It's difficult to determine when a repository is fetched, so to avoid ambiguity, you should ensure each repository is declared only once.

How to organize a `WORKSPACE` file

Now we come to the practical advice. I recommend organizing statements in WORKSPACE files in the following order:

workspace declaration. This must appear before all other calls.
load statements for http_archive, git_repository, and repository rules defined in the main workspace. These symbols are needed in the rest of the file, so they must be loaded near the top.
Declarations for dependencies that provide repository rules needed later. For example, bazel_gazelle is needed for go_repository.
Declarations for direct dependencies. These may appear in the WORKSPACE file itself, or you might load and call a function from a .bzl file somewhere in your workspace.
Declarations for indirect dependencies. To declare these, you'll usually load and call functions from your direct dependencies. Check that these functions won't override your direct dependencies (see below).

Many projects declare indirect dependencies before direct dependencies (reversing 4 and 5 above). This causes problems because it limits your ability to depend on a specific version of a direct dependency. If a repository is declared by a function provided by one of your dependencies, that declaration may or may not override a later (direct) declaration. Your direct declaration will be silently ignored if the repository is fetched first.

Providing dependencies for other projects

If your project can be built with Bazel, and other projects can depend on it, you should provide a function that declares your direct and indirect dependencies so that other projects can declare them without knowing the details. Let's look at the function from @com_google_protobuf//:protobuf_deps.bzl as an example:

load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def protobuf_deps():
    """Loads common dependencies needed to compile the protobuf library."""


    if not native.existing_rule("bazel_skylib"):
        http_archive(
            name = "bazel_skylib",
            sha256 = "97e70364e9249702246c0e9444bccdc4b847bed1eb03c5a3ece4f83dfe6abc44",
            urls = [
                "https://mirror.bazel.build/github.com/bazelbuild/bazel-skylib/releases/download/1.0.2/bazel-skylib-1.0.2.tar.gz",
                "https://github.com/bazelbuild/bazel-skylib/releases/download/1.0.2/bazel-skylib-1.0.2.tar.gz",
            ],
        )

    if not native.existing_rule("zlib"):
        http_archive(
            name = "zlib",
            build_file = "@com_google_protobuf//:third_party/zlib.BUILD",
            sha256 = "629380c90a77b964d896ed37163f5c3a34f6e6d897311f1df2a7016355c45eff",
            strip_prefix = "zlib-1.2.11",
            urls = ["https://github.com/madler/zlib/archive/v1.2.11.tar.gz"],
        )
# Many more dependencies after this

There are several good lessons to learn from this file:

Name the file deps.bzl or something similar (protobuf_deps.bzl in this case), and put it in the root directory of the repository so it's easy to find.
Keep the file simple. Avoid loading other .bzl files, since that forces those repositories to be declared earlier.
Don't override earlier declarations of the same repositories. You can check whether a dependency has been declared by calling native.existing_rule with its name, as above.

You may want to define a small function like this:

def _maybe(rule, name, **kwargs):
    if not native.existing_rule(name):
        rule(name = name, **kwargs)

Then you can declare dependencies like this:

_maybe(
    http_archive,
    name = "zlib",
    build_file = "@com_google_protobuf//:third_party/zlib.BUILD",
    sha256 = "629380c90a77b964d896ed37163f5c3a34f6e6d897311f1df2a7016355c45eff",
    strip_prefix = "zlib-1.2.11",
    urls = ["https://github.com/madler/zlib/archive/v1.2.11.tar.gz"],
)

Conclusion

I'll wrap this up by saying dependency management is inherently complicated. When you depend on another project, you're trusting its authors to deliver something that performs well and is free of bugs and security vulnerabilities. Dependencies are often necessary; after all, we don't want to write our own crypto libraries. But taking on a dependency is dangerous, and it should be done consciously and carefully. For further reading, I strongly recommend Russ Cox's Our Software Dependency Problem.