How GitHub's upgrade broke Bazel builds

And how we can be more resilient in the future

Published on 2023-02-08
Tagged: bazel dependency-management

Last week, GitHub upgraded the internal version of Git they use to produce repository archives. You've probably used these archives before if you've downloaded a .zip or .tar.gz file from a repository at a particular version. GitHub produces those archives on demand using git archive and caches them for a short time.

Upgrading Git regularly is a generally good idea, but this change regrettably broke a huge number of Bazel projects. What happened? Most Bazel projects fetch at least some of their dependencies using rules in their WORKSPACE files like this:

http_archive(
    name = "com_github_bazelbuild_buildtools",
    sha256 = "05eff86c1d444dde18d55ac890f766bce5e4db56c180ee86b5aacd6704a5feb9",
    strip_prefix = "buildtools-6.0.0",
    urls = ["https://github.com/bazelbuild/buildtools/archive/refs/tags/6.0.0.tar.gz"],
)

See that /archive/refs/tags/ part of the path? That's the endpoint I'm talking about.

This is bare bones dependency management: Bazel attempts to download an archive from the first URL in the list; it tries the next URL if the first is not available and so on. Bazel then checks the file's SHA-256 sum against the known value, and if it's correct, extracts the archive and proceeds with the build.

The Git upgrade caused a change in archives' SHA-256 sums. I think there was a small change in zip compression, but it doesn't really matter: any variation in file ordering, alignment, or compression causes the archives' SHA-256 sums to change even though the extracted contents are the same.

This is at least the third time Bazel builds have broken that I can remember. This has also been discussed extensively before. I'm writing this with the hope that we can make our systems more resilient and avoid these kinds of problems in the future.

What could GitHub do better?

Since GitHub made the change that triggered this, they naturally get the immediate blame from the community, though I think it's mostly undeserved. Upgrading dependencies (especially Git, especially if you're GitHub) is a reasonable thing to do. To my knowledge, GitHub has not documented a guarantee that files returned by the archive endpoints have stable SHA-256 sums. It's a mistake for users to rely on a guarantee that was never made. It's tempting of course because it's easy, but it's a mistake nonetheless.

This is a classic example of Hyrum's Law.

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

Since these updates have broken Bazel (and presumably others) a few times now, I'd really like to see GitHub clarify in documentation whether users should or should not depend on stable archive SHA-256 sums. A GitHub engineer commented that this is not stable, but product managers and support engineers have commented at other times that is stable. I don't really think discussion comments count since they're not discoverable. Only official documentation is authoritative.

I haven't actually found any documentation for these release archive URLs, so I'm not sure where this clarification should go. It's not part of the REST API. Linking to releases is pretty close.

If archive SHA-256 sums ARE guaranteed to be stable (now or in the future), I think documenting and testing that would let us all sleep easier at night.

If archive SHA-256 sums ARE NOT guaranteed to be stable, it wouldn't be a terrible idea to inject a little chaos to prevent people from depending on them. For example, in Go, the iteration order of elements in a map is undefined. To prevent developers from depending on iteration order (and tests from breaking when the hashing algorithm is tweaked), the Go runtime adds a random factor into the hashing algorithm, so the iteration order is different every time a program runs. Something similar could be done here with archive file order or alignment. I wouldn't suggest gratuitously breaking this API, but if it needs to change anyway for some reason in the future, it would be a good idea to add something like this.

What could the Bazel community do better?

Bazel developers should not rely on stable archive SHA-256 sums unless that stability is guaranteed and documented by GitHub. More importantly, developers should not rely on dependency artifacts being available on GitHub at all: a library author could delete their project at any time.

I'll point to Go modules as a model of a great dependency management system, designed to solve this exact problem. The Go team operates proxy.golang.org, a mirror for all publicly available Go modules. Internally, the proxy stores actual files for each module and does not need to regenerate them. The proxy protocol is open and easy to implement as an HTTP file server, so you can run your own proxy service for better availability. I'd love to see something like this happen for Bazel, especially if it's operated by Google. It is not technically difficult to build a service like this, but there are a lot of thorny issues around handling abuse and legally distributing software with unrecognized licenses, and Google has already figured out those issues for Go.

Until such a service exists, developers can protect themselves by copying their dependencies to their own mirror. A GCS or S3 bucket works fine.

Library authors can and should protect their users by providing static release artifacts (not dynamically generated archives), and mirroring those. For example, check out the http_archive boilerplate for rules_go:

http_archive(
    name = "io_bazel_rules_go",
    sha256 = "dd926a88a564a9246713a9c00b35315f54cbd46b31a26d5d8fb264c07045f05d",
    urls = [
        "https://mirror.bazel.build/github.com/bazelbuild/rules_go/releases/download/v0.38.1/rules_go-v0.38.1.zip",
        "https://github.com/bazelbuild/rules_go/releases/download/v0.38.1/rules_go-v0.38.1.zip",
    ],
)

The file rules_go-v0.38.1.zip is created by the rule authors and attached to the release; it's not dynamically generated.

It's also copied to mirror.bazel.build, which is a thin frontend on a GCS bucket, shared by many rule authors in the bazelbuild organization.

One other tip: if you're feeling adventurous enough to use an experimental, undocumented feature (to make your build more stable! Really!), you can configure Bazel's downloader to rewrite those GitHub URLs to point to your own mirror.

Aside: SHA-256 of archives or contents?

It's unfortunate that a change to git archive that does not affect extracted contents of an archive can still change its SHA-256 sum. Bazel absolutely does the right thing though by checking the sum of the downloaded file before extracting its contents.

This is the delightfully named Cryptographic Doom Principle. If Bazel only authenticated the contents of an archive, it might be possible for an attacker to exploit a vulnerability in Bazel's zip parser before the archive is authenticated. Since Bazel authenticates the archive before extracting it, the pre-authentication attack surface is very small.

Closing thoughts

When you're designing software, think carefully about how it's going to be used. If there's a right way and a wrong way to do something, make sure the right way is easier and more obvious. Better yet, make the right way the only way.

I think this is a case where Bazel's dependency management is too limited: to use http_archive safely, you need to set up an HTTP mirror with copies of your dependencies. That's too much work for users, especially new users who aren't aware of the hazards. A more complete dependency management system should include an artifact registry or a read-through caching system with at least one public implementation. I was hoping Bazel Modules and the Bazel Central Registry would provide that, but the central registry only includes module metadata: module content is separate, specified in URLs that still frequently refer to the unstable GitHub endpoint.