Migrating to Bazel: Part 2

Published on 2017-03-16
Tagged: bazel gypsum

View All Posts

This article is part of the series "Migrating Gypsum and CodeSwitch to Bazel".

In the previous article, I talked about migrating Gypsum and CodeSwitch, my open source compiler and interpreter, to the Bazel build system. I focused mostly on getting Python and C++ libraries and tests to build. Bazel has built-in support for Python and C++, so this wasn't too challenging.

This time, I'll talk about adding support to Bazel for building Gypsum packages. Before we get to that though, I'll give a bit more background on the configuration language and how Bazel deals with extensions.

Skylark: differences with Python

As you'll recall, Skylark is Bazel's build configuration and extension language. Skylark is basically Python without the language constructs that make Python programs complicated.

In terms of syntax, Skylark has variables, functions, list comprehensions, if-statements, for-loops (but no while loops), and most of the expressions you normally see in Python. In terms of data structures, Skylark provides tuples, lists, dicts, and sets. There are no classes, exceptions, or iterators. Recursion is not allowed. This means Skylark is not actually Turing-complete.

Skylark has two dialects. The build dialect is used to write BUILD files that describe what is being built. This is where the cc_binary and py_test rules are declared. These files only allow rule declarations (which look like function calls), variable definitions, and basic expressions. Function definitions and if-statements are not allowed. You can declare a group of similar rules inside a list comprehension though.

Here's a snippet from a BUILD file:

# Variable declaration
GYPSUM_DEPS = ["@yaml//:yaml"]

# Rule declaration / function call
py_binary(
    name = "compiler",
    srcs = [":sources"],
    deps = GYPSUM_DEPS,
    data = [":common"],
    main = "__main__.py",
    visibility = ["//visibility:public"],
)

# Rule declarations inside a list comprehension
[py_test(
    name = test_file[:-len(".py")],
    size = "small",
    srcs = [
        test_file,
        "utils_test.py",
        ":sources",
    ],
    deps = GYPSUM_DEPS,
    data = [":common"],
) for test_file in glob(["test_*.py"])]

The extension dialect is used to define write new functions, macros, and rules in .bzl files. All Skylark syntax is allowed in these files.

# Simple function definition
def gy_test_name(file_name):
    """Computes a unit test name, given a Gypsum file name.

    For example, given "test/map-list-example.gy", this returns "MapListExample".
    """
    base_name = file_name[file_name.index("/") + 1 : file_name.rindex(".")]
    words = base_name.split("-")
    name = "".join([w.capitalize() for w in words])
    return name

Definitions in .bzl files can be loaded into BUILD files or other .bzl files with the load function. Definitions prefixed with "_" are private and cannot be loaded, but everything else is visible. When a file is loaded, it's global state is recursively frozen. This means that everything you load from another file is immutable. This enables Bazel to load files in parallel and cache their contents. It also frees you from having to worry about the order in which files or loaded or weird side effects from loading unrelated files (which is a big problem with Makefiles).

load("//:build_defs.bzl", "gy_library")
load(":build_defs.bzl", "py_gen_file", "doxygen_archive", "gy_test_name")

Build phases

Bazel does its work in three phases: loading, analysis, and execution. It's important to be aware of these phases in order to understand where rules and macros fit in and what they can do.

During the loading phase, Bazel loads and evaluates BUILD and .bzl files. Bazel will only evaluate files for packages requested on the command line and files loaded with the load function. The result of the loading phase is the rule dependency graph. Each vertex in this graph is a rule (declared with cc_library, etc.). Each edge is a dependency, e.g., a cc_library depends on source files and other libraries.

The analysis phase translates the rule dependency graph into a file dependency graph. The file graph is a relation between build artifacts and their dependencies. Each build artifact has an action that specifies how to build it. Unlike the rule graph, the file graph contains intermediate results that aren't explicitly mentioned in BUILD files. For example, think about how a cc_binary is specified and built. The rule declares the resulting executable, the sources, and the dependencies. The rule implementation adds the binary file and the intermediate .o files to the file graph (which are never mentioned in BUILD files), along with compile and link actions that produce them.

During the execution phase, actions are executed in order to produce the targets specified by the user. Bazel scans the file graph for files that are missing or out of date, then runs the associated actions in a sandboxed environment. The sandbox prevents an action from referencing dependencies that Bazel does not know about.

Macros and rules

You can extend Bazel by writing new macros and rules in a .bzl file. A macro is a function that instantiates one or more rules. This is useful for eliminating repetition in build files. For example, if you want to build several similar executables (py_binary), then generate source files with them (genrule), you can use a macro. Macros are evaluated during the loading phase.

Macros are declared like regular Python functions. When you instantiate a rule inside a macro, you must use the "native." prefix.

def py_gen_file(name, script, data, out):
    script_name = "gen_" + name
    native.py_binary(
        name = script_name,
        srcs = [script],
        main = script,
        deps = ["@yaml//:yaml"],
    )
    native.genrule(
        name = name,
        srcs = [data],
        tools = [script_name],
        outs = [out],
        cmd = "python $(location :%s) $< $@" % script_name,
    )

Macros are instantiated in BUILD files by loading them and calling them like functions.

load(":build_defs.bzl", "py_gen_file")

py_gen_file(
    name = "builtins",
    script = "src/gen_builtins_h.py",
    data = "//:common/builtins.yaml",
    out = "src/builtins.h",
)

A rule adds new functionality to Bazel by providing a new way to convert part of the rule graph into part of the file graph. The rule implementation can declare new build artifacts and can attach actions to those artifacts. Rules are declared during the loading phase, but their implementations are executed during the analysis phase. Rules are powerful, but some knowledge of Bazel is required to use them correctly.

To define a new rule, the rule function must be called with a rule implementation function, a dict of attributes, and a dict of outputs. Attributes are arguments that get passed to the rule when it is instantiated. Usually these are labels for sources and dependencies (srcs, deps), but they can be things like compiler or linker flags.

Here's a simple rule I defined to generate Doxygen documentation for CodeSwitch. It assumes Doxygen is already installed on the host machine. It substitutes the string "@@OUTPUT_DIRECTORY@@" in the given Doxyfile, executes Doxygen with the substituted file, then packages the result in a .tar.gz file.

def _doxygen_archive_impl(ctx):
    """Generate a .tar.gz archive containing documentation using Doxygen.

    Args:
        name: label for the generated rule. The archive will be "%{name}.tar.gz".
        doxyfile: configuration file for Doxygen
        srcs: source files the documentation will be generated from.
    """
    doxyfile = ctx.file.doxyfile
    out_file = ctx.outputs.out
    out_dir_path = out_file.short_path[:-len(".tar.gz")]
    commands = [
        "mkdir -p %s" % out_dir_path,
        "out_dir_path=$(cd %s; pwd)" % out_dir_path,
        "pushd %s" % doxyfile.dirname,
        "sed -e \"s:@@OUTPUT_DIRECTORY@@:$out_dir_path/codeswitch-api/:\" <%s" +
            " | doxygen -" % doxyfile.basename,
        "popd",
        "tar czf %s -C %s codeswitch-api" % (out_file.path, out_dir_path),
    ]
    ctx.action(
        inputs = ctx.files.srcs + [doxyfile],
        outputs = [out_file],
        command = " && ".join(commands),
    )


doxygen_archive = rule(
    implementation = _doxygen_archive_impl,
    attrs = {
        "doxyfile": attr.label(
            mandatory = True,
            allow_files = True,
            single_file = True),
        "srcs": attr.label_list(
            mandatory = True,
            allow_files = True),
    },
    outputs = {
        "out": "%{name}.tar.gz",
    },
)

The implementation function does the heavy lifting. It takes a ctx argument, which provides information about the rule and its attributes. ctx.file provides access to single-file attributes like doxyfile. ctx.files provides access to multi-file attributes like srcs. ctx.outputs provides access to the outputs for the rule (this matches the outputs dict, specified in the call to the rule function). Every file in outputs must have an action that generates it. ctx.action creates a new action which is either an executable with arguments or a shell command (as in this case).

gy_library and gy_binary

Let's look at some more interesting rules: the rules that actually add Bazel support for Gypsum. gy_library builds a Gypsum package from source, using the Gypsum compiler. We'll start with the dict of attributes for this rule. These attributes are shared with gy_binary, so they are declared separately.

_gy_attrs = {
    "package_name": attr.string(default="default"),
    "package_version": attr.string(default="0"),
    "srcs": attr.label_list(allow_files=[".gy"]),
    "deps": attr.label_list(providers=[
        "gy",
        "transitive_pkg_dirs",
    ]),
    "data": attr.label_list(allow_files=True, cfg="data"),
    "native_lib": attr.label(cfg="target", providers=["cc"]),
    "flags": attr.string_list(),
    "_gy_compiler": attr.label(
        executable=True, cfg="host", default=Label("//gypsum:compiler")),
}

package_name and package_version are passed as arguments to the compiler. srcs is the list of Gypsum source files that comprise the package. deps is a list of Gypsum packages that the package being compiled depends on. native_lib is an optional dependency on a shared library that CodeSwitch can load dynamically. flags is a list of other flags that should be passed to the compiler. _gy_compiler is an implicit dependency on the Gypsum compiler. Attributes that start with "_" are implicit and must have a default value.

Note the providers argument for some of the labels. Providers pass information from a rule to other rules that depend on it. Specifying a list of providers for an attribute requires dependencies to provide that information. In the case of deps, we want other Gypsum rules, hence the "gy" provider. In the case of native_lib, we want a C++ rule, hence the "cc" provider.

Here's the call to the rule function for gy_library:

gy_library = rule(
    implementation = _gy_library_impl,
    attrs = _gy_attrs,
    outputs = {"pkg": "%{package_name}-%{package_version}.csp"},
    fragments = ["cpp"],
)

This is pretty straightforward. The package file is our only output. Its name is based on the package_name and package_version attributes. We need access to the cpp fragment, since we use platform information to determine an appropriate file extension for the native library.

Here's the implementation function and the helper functions that do most of the work.

def _gy_library_impl(ctx):
    _compile_gy_package(ctx)
    return _prepare_gy_providers(ctx)

def _compile_gy_package(ctx):
    args = [
        "--package-name", ctx.attr.package_name,
        "--package-version", ctx.attr.package_version,
        "--output", ctx.outputs.pkg.path,
    ]
    inputs = []
    for f in ctx.files.deps:
        args.append("--depends")
        args.append(f.path)
        inputs.append(f)
    args += [f.path for f in ctx.files.srcs]
    inputs += ctx.files.srcs
    args += ctx.attr.flags
    ctx.action(
        inputs = inputs,
        outputs = [ctx.outputs.pkg],
        arguments = args,
        executable = ctx.executable._gy_compiler,
        progress_message = "Compiling Gypsum package %s" % ctx.outputs.pkg.path,
    )

_gy_provider = provider()

def _prepare_gy_providers(ctx):
    files = [ctx.outputs.pkg]
    pkg_dir_name = _short_dirname(ctx.outputs.pkg)
    transitive_pkg_dirs = set([pkg_dir_name])
    symlinks = {}

    package_name = ctx.attr.package_name
    package_version = ctx.attr.package_version
    package_file = [ctx.outputs.pkg]
    native_lib_file = None
    if ctx.attr.native_lib:
        native_lib_file = [f for f in ctx.files.native_lib
                           if f.extension == "so"][0]
        files.append(native_lib_file)
        cpu = ctx.fragments.cpp.cpu
        lib_ext = "dylib" if cpu == "darwin" else "so"
        link_path = ("%s/lib%s-%s.%s" % 
                     (pkg_dir_name, package_name, package_version, lib_ext))
        if link_path != native_lib_file.short_path:
            symlinks[link_path] = native_lib_file
    gy = _gy_provider(
        package_name = package_name,
        package_version = package_version,
        package_file = package_file,
        native_lib_file = native_lib_file,
    )

    if hasattr(ctx, "_codeswitch_cmd"):
        files += ctx.attr._codeswitch_cmd
    runfiles = ctx.runfiles(
        files = files,
        collect_data = True,
        symlinks = symlinks)

    for dep in ctx.attr.deps:
        transitive_pkg_dirs += dep.transitive_pkg_dirs
        runfiles = runfiles.merge(dep.data_runfiles)

    return struct(
        gy = gy,
        transitive_pkg_dirs = transitive_pkg_dirs,
        runfiles = runfiles,
    )

def _short_dirname(f):
    short_path = f.short_path
    i = short_path.rfind("/")
    return short_path[:i] if i != -1 else "."

_compile_gy_package inserts an action into the file graph that actually compiles the Gypsum package. This is done by calling ctx.action with an executable (the Gypsum compiler), a list of inputs (source files and dependencies), a list of outputs (the package being compiled), and a list of arguments (package name, version, flags, etc.).

_prepare_gy_providers gathers information to be passed to other rules. There are three important providers here.

Hopefully that made at least a small amount of sense. Let's look at the gy_binary rule next.

gy_binary = rule(
    implementation = _gy_binary_impl,
    executable = True,
    attrs = _gy_attrs + {
        "_codeswitch_cmd": attr.label(
            executable=True,
            allow_single_file=True,
            cfg="target",
            default=Label("//codeswitch:codeswitch_cmd")),
    },
    outputs = {
        "pkg": "%{package_name}-%{package_version}.csp",
    },
    fragments = ["cpp"],
)

def _gy_binary_impl(ctx):
    # Build package
    _compile_gy_package(ctx)
    providers = _prepare_gy_providers(ctx)

    # Generate runner script.
    args = [ctx.file._codeswitch_cmd.short_path]
    for pkg_dir in providers.transitive_pkg_dirs:
        args += ["--package-path", "'%s'" % pkg_dir]
    args += ["--package", "'%s'" % ctx.attr.package_name]
    command = "exec " + " ".join(args)
    ctx.file_action(
        output = ctx.outputs.executable,
        content = command,
        executable = True
    )

    return providers

The rule definition is pretty much the same as gy_library, but there's an additional implicit dependency on the CodeSwitch interpreter, which is a cc_binary defined elsewhere. The executable argument indicates this rule will define an action for the file ctx.outputs.executable in addition to its other outputs.

The rule implementation compiles a Gypsum package and builds providers the same way that _gy_library_impl does. It also creates a shell script in ctx.outputs.executable that starts the interpreter with the appropriate packages and paths.

Gypsum rules in action

gy_library is used within to build the standard library and several test cases for CodeSwitch. Here is the BUILD file for std.io:

load("//:build_defs.bzl", "gy_library")

package(default_visibility = ["//visibility:public"])

gy_library(
    name = "io",
    package_name = "std.io",
    package_version = "0",
    srcs = glob(["src/*.gy"]),
    deps = ["//std"],
    native_lib = ":std.io-native",
)

cc_library(
    name = "std.io-native",
    srcs = glob(["src/*.cpp"]),
    deps = ["//codeswitch:codeswitch"],
)

gy_binary is used to build the programs in the examples/ directory. Here's the BUILD file from there:

load("//:build_defs.bzl", "gy_binary")

[gy_binary(
    name = source_file[:source_file.rindex(".")],
    package_name = source_file[:-3].replace("-", ""),
    srcs = [source_file],
    deps = ["//std"]
) for source_file in glob(["*.gy"])]

Conclusion

Migrating Gypsum and CodeSwitch to Bazel was a fairly significant effort, but I'm happy with the result. Bazel is a really powerful system. I think once a little more functionality is exposed in Skylark, it will be ideal for cross-language, cross-platform projects.

It took me longer than I would have liked to come up to speed on Bazel and Skylark. Community documentation (blog posts, StackOverflow questions, etc) is a little thin, which is to be expected since Bazel hasn't been out for very long and is still pre-1.0. Maybe these articles will help with that.

If you're interested in writing your own Bazel extensions, start with extensions examples on the main Bazel site. You may also want to look at some of the extensions for other languages, like rules_go or rules_rust. I'm one of the maintainers for the Go rules, so keep an eye on that.