Migrating to Bazel: Part 1

Published on 2017-02-21
Tagged: bazel gypsum

View All Posts

This article is part of the series "Migrating Gypsum and CodeSwitch to Bazel".

Bazel is the open source version of Google's internal build system. I transferred to a new team where I'll be working with Bazel a lot, so I figured I'd migrate Gypsum and CodeSwitch to use Bazel. I learned a lot.

Introducing Bazel

Bazel is relatively new (first published in March 2015, not 1.0 yet), so there isn't a large community yet, and there aren't a lot of projects using it. The system Bazel is derived from is quite capable and mature though. That said, the open source use case is different from Google's internal use case, and you may find some odd anachronisms and adaptations.

Bazel is configured using three sets of files. You need a WORKSPACE file in the root directory of your project. It tells Bazel where to find your external dependencies. You need one or more BUILD files (one in each directory where you want to build something). Everything that can be built, tested, or run is declared in these files. If you want to extend Bazel with new rules, macros, or helper functions, you can define those in files ending with the .bzl extension.

Bazel files are written in a language called Skylark, which is a minimal subset of Python. This is great, since you don't have to learn a new language to write your build files. Skylark lacks a lot of the powerful Python concepts like classes and exceptions. Skylark isn't even Turing complete since recursion is not allowed, and you can only loop over collections of finite length. I consider these limitations an advantage, having seen some absurdly complicated things written in Makefiles and SCons. Anything that can't be expressed in Skylark can always go in a separate script without complicating the build system.

Bazel is a cross-language, cross-platform build tool. It has native support for building executables, libraries, and tests in Java, C++, Objective C, Python, and Bash. This includes support for Android and iOS apps. There are open source extensions that add support for Go, Scala, Rust, and many other languages. I added support for Gypsum, but I'll cover that in the next article.

One of Bazel's key advantages is full support for generated code. You don't have to run a separate code generation script before starting the build or anything crazy like that. Files can be created with genrules, which execute shell commands or executables (which themselves may be generated). The outputs of a genrule may be used as sources for other targets.

Why use Bazel for Gypsum?

Previously, I was using makefiles to build and test Gypsum and CodeSwitch. These had gotten pretty difficult to maintain. I always dreaded adding a new test that needed to be built in an usual way because it meant copying and customizing a lot of makefile boilerplate. Make also makes it really easy to forget about dependencies. You might accidentally leave something out, make a change, then wonder why some target it not getting rebuilt. Make doesn't rebuild when you change the makefiles themselves or when you change the configuration through environment variables or arguments. It also doesn't fetch external dependencies or cache test results. Make had to go.

Gypsum is a cross-language project, so I couldn't choose something that was C++-specific or Python-specific. I also wanted something that could easily be extended to work with Gypsum itself, since I need to build the standard library and a lot of tests. Bazel was a natural choice, especially since I'm already familiar with it through work.

Building the compiler with Bazel

I started the migration by creating a WORKSPACE file that imports PyYAML from PyPI. PyYAML is the only Python library Gypsum depends on that doesn't come with Python 2.7, so there was only one rule in this file.

# Bazel file for fetching external dependencies

# YAML is used to parse files in the common/ directory that contain
# information about opcodes, flags, and builtin classes and functions.
new_http_archive(
    name = "yaml",
    build_file = "BUILD-yaml",
    sha256 = "592766c6303207a20efc445587778322d7f73b161bd994f227adaa341ba212ab",
    url = ("https://pypi.python.org/packages/4a/85/" +
        "db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a" +
        "/PyYAML-3.12.tar.gz"),
    strip_prefix = "PyYAML-3.12/lib/yaml",
)

new_http_archive is a rule that fetches an archive file (a .tar.gz file in this case) from a URL. The archive is checked against the given SHA-256 sum.

Bazel targets are declared in BUILD files. Since PyYAML is not built with Bazel, I had to supply my own. Fortunately, PyYAML is pure Python, so this was very simple. Here is BUILD-yaml.

# Bazel file for PyYAML external dependency

py_library(
    name = "yaml",
    srcs = glob(["*.py"]),
    visibility = ["//visibility:public"],
)

Bazel targets are referenced with labels. A label has three parts: a repository name, a package path (a package is a directory containing a build file), and a target name. These are written in the format @repo//package/path:target. So the label for YAML is @yaml//:yaml (the package path is empty since the library is declared in the root package of that repository). The repository name can be omitted for targets in the local repository. The package path can be omitted for targets in the same package. The target name can be omitted if it is the same as the last component of the package path.

With the YAML dependency in place, I can build the compiler. Here's part of the gypsum/BUILD file.

filegroup(
    name = "sources",
    srcs = glob(["*.py"], exclude=["test_*.py", "utils_test.py"]),
    visibility = ["//:__subpackages__"],
)

filegroup(
    name = "common",
    srcs = [
        "builtins.yaml",
        "flags.yaml",
        "opcodes.yaml",
    ],
    visibility = ["//:__subpackages__"],
)

GYPSUM_DEPS = ["@yaml//:yaml"]

py_binary(
    name = "compiler",
    srcs = [":sources"],
    deps = GYPSUM_DEPS,
    data = [":common"],
    main = "__main__.py",
    visibility = ["//visibility:public"],
)

:sources is the set of Python files that are part of the compiler proper. This excludes tests and test utilities in the same directory. A filegroup is not a buildable target, but it can be used to control visibility. It expands as part srcs in other targets. visibility = ["//:__subpackages__"] means that :sources is visible to targets in every package in this repository but not to targets outside this repository.

:common is a set of YAML files used by the compiler. I've created symlinks from the gypsum/ directory to the real files in the common/ directory to simplify packaging.

:compiler is the actual compiler executable. In this declaration, srcs refers to a list of Python source files used to build the compiler. deps is a list of library dependencies. data is a list of files which are not part of the generated executable but must be available at run-time (the .yaml files).

You can build the compiler by running bazel build //gypsum:compiler. This produces the executable bazel-bin/gypsum/compiler. This is a tree of symlinked Python files, not a single binary that can be distributed. Useful nonetheless. You can run the compiler with bazel run //gypsum:compiler args....

Being able to build and run the compiler is great, but I also need to test it. Here's the rest of gypsum/BUILD.

[py_test(
    name = test_file[:-3],
    size = "small",
    srcs = [
        test_file,
        "utils_test.py",
        ":sources",
    ],
    deps = GYPSUM_DEPS,
    data = [":common"],
) for test_file in glob(["test_*.py"])]

This is a Python-style list comprehension that declares a py_test target for each file matching the pattern test_*.py. List comprehensions are the only looping constructs allowed inside BUILD files. Each target name is test_file[:-3], which is the test file name with the ".py" extension removed (this is a string slice). size = "small" means each test is expected to run in less than 60 seconds. Bazel will time out tests that take too long. srcs, deps, and data have the same meaning as before.

You can run all of the compiler tests with bazel test //gypsum:all. You can run an individual test (the parser test for example) with bazel test //gypsum:test_parser.

Building CodeSwitch with Bazel

CodeSwitch is in its own directory, so it has its own BUILD file, codeswitch/BUILD. CodeSwitch is written in C++, but the build rules are pretty similar to the Python rules. There are two major differences that make building CodeSwitch more challenging. First, there are some platform-specific source files (glue code for native functions) that need to be conditionally included depending on the target configuration. Second, CodeSwitch relies on generated source files instead of reading and parsing the YAML files directly.

I'll show the rule for the CodeSwitch library first, then we can get into the complications mentioned above.

cc_library(
    name = "codeswitch",
    srcs = glob(["src/*.h", "src/*.cpp"]) + [
        ":src/builtins.h",
        ":src/flags.h",
        ":src/opcodes.h",
        ":src/roots_builtins.cpp",
    ] + select(
        {
            ":linux-x64": glob([
                "src/posix/*",
                "src/linux/*",
                "src/x64/*",
                "src/linux-x64/*",
            ]),
            ":osx-x64": glob([
                "src/posix/*",
                "src/osx/*",
                "src/x64/*",
                "src/osx-x64/*"
            ]),
        },
        no_match_error = "unsupported platform",
    ),
    hdrs = glob(["include/*.h"]),
    includes = ["include"],
    defines = [
        "WORDSIZE=64",
        "PAGESIZE=4096",
    ] + select(
        {
            ":debug": ["DEBUG"],
            "//conditions:default": [],
        }
    ),
    copts = CODESWITCH_COPTS,
    linkopts = ["-ldl"],
    visibility = ["//visibility:public"],
)

Let's talk about conditionally included sources first. These are brought in with a call to the select function. select takes a dictionary as its argument. Each key is a label that references a config_setting, a named pattern that can match the current build configuration. Each value is something that select should return if the config_setting matches. In this case, I've declared two config_settings that match the platforms that CodeSwitch supports (linux-x64 and osx-x64). These are passed to select, along with the source files specific to those platforms. Here are the settings:

config_setting(
    name = "linux-x64",
    values = {
        "cpu": "k8",
    },
)

config_setting(
    name = "osx-x64",
    values = {
        "cpu": "darwin",
    },
)

"cpu" refers to the value of the --cpu command line flag, which defaults to the current platform (k8 for Linux running on an amd64 processor; darwin for macOS also running on amd64). You can set this explicitly for cross compilation.

Next, let's talk about generated sources. You can generate files using a shell command or a custom executable with a genrule. I use some Python scripts to generate my sources. They all follow the same pattern, so I created a macro in Skylark:

def py_gen_file(name, script, data, out):
    """Generate a file using a Python script.

    Note that the script may not be used multiple times, since this macro
    creates a py_binary for it.

    Args:
        name: label for the generated file.
        script: the Python script to execute.
        data: data file provided to the script on the command line.
        out: output file.
    """
    script_name = "gen_" + name
    native.py_binary(
        name = script_name,
        srcs = [script],
        main = script,
        deps = ["@yaml//:yaml"],
    )
    native.genrule(
        name = name,
        srcs = [data],
        tools = [script_name],
        outs = [out],
        cmd = "python $(location :%s) $< $@" % script_name,
    )

Each time the macro is invoked, it creates two targets: a py_binary for the generating script, and a genrule that runs the script and produces the C++ source. Here is the macro in action:

py_gen_file(
    name = "builtins",
    script = "src/gen_builtins_h.py",
    data = "//:common/builtins.yaml",
    out = "src/builtins.h",
)

py_gen_file(
    name = "flags",
    script = "src/gen_flags_h.py",
    data = "//:common/flags.yaml",
    out = "src/flags.h",
)

py_gen_file(
    name = "opcodes",
    script = "src/gen_opcodes.py",
    data = "//:common/opcodes.yaml",
    out = "src/opcodes.h",
)

py_gen_file(
    name = "roots_builtins",
    script = "src/gen_roots_builtins_cpp.py",
    data = "//:common/builtins.yaml",
    out = "src/roots_builtins.cpp",
)

With conditional includes and generated sources out of the way, hopefully the rest of :codeswitch makes sense. The CodeSwitch command line driver is a very simple program built on top of this:

cc_binary(
    name = "codeswitch_cmd",
    srcs = ["programs/driver.cpp"],
    deps = [":codeswitch"],
    copts = ["-std=c++11"],
    visibility = ["//visibility:public"],
)

The tests are unfortunately much more complicated because they depend on compiled Gypsum bytecode. I'll discuss that more in the next article.

Conclusion

I've been frustrated by terrible build systems for a long time. Most are configured with some terrible domain specific language or have a bunch of byzantine rules on top of a user-hostile language like JSON or XML. It's frequently difficult to use generated code or add support for a new programming language.

Bazel feels like a breath of fresh air. It's configuration language is a subset of Python, which makes it intuitive and readable. It's extension mechanism is a little difficult to learn (perhaps under-documented), but it works well once you understand it, and you can get a lot done with a small amount of code. More about that next time though.