Migrating to Bazel: Part 2
In the previous article, I talked about migrating Gypsum and CodeSwitch, my open source compiler and interpreter, to the Bazel build system. I focused mostly on getting Python and C++ libraries and tests to build. Bazel has built-in support for Python and C++, so this wasn't too challenging.
This time, I'll talk about adding support to Bazel for building Gypsum packages. Before we get to that though, I'll give a bit more background on the configuration language and how Bazel deals with extensions.
Skylark: differences with Python
As you'll recall, Skylark is Bazel's build configuration and extension language. Skylark is basically Python without the language constructs that make Python programs complicated.
In terms of syntax, Skylark has variables, functions, list comprehensions, if
-statements, for
-loops (but no while loops), and most of the expressions you normally see in Python. In terms of data structures, Skylark provides tuples, lists, dicts, and sets. There are no classes, exceptions, or iterators. Recursion is not allowed. This means Skylark is not actually Turing-complete.
Skylark has two dialects. The build dialect is used to write BUILD files that describe what is being built. This is where the cc_binary
and py_test
rules are declared. These files only allow rule declarations (which look like function calls), variable definitions, and basic expressions. Function definitions and if
-statements are not allowed. You can declare a group of similar rules inside a list comprehension though.
Here's a snippet from a BUILD file:
# Variable declaration GYPSUM_DEPS = ["@yaml//:yaml"] # Rule declaration / function call py_binary( name = "compiler", srcs = [":sources"], deps = GYPSUM_DEPS, data = [":common"], main = "__main__.py", visibility = ["//visibility:public"], ) # Rule declarations inside a list comprehension [py_test( name = test_file[:-len(".py")], size = "small", srcs = [ test_file, "utils_test.py", ":sources", ], deps = GYPSUM_DEPS, data = [":common"], ) for test_file in glob(["test_*.py"])]
The extension dialect is used to define write new functions, macros, and rules in .bzl files. All Skylark syntax is allowed in these files.
# Simple function definition def gy_test_name(file_name): """Computes a unit test name, given a Gypsum file name. For example, given "test/map-list-example.gy", this returns "MapListExample". """ base_name = file_name[file_name.index("/") + 1 : file_name.rindex(".")] words = base_name.split("-") name = "".join([w.capitalize() for w in words]) return name
Definitions in .bzl files can be loaded into BUILD files or other .bzl files with the load
function. Definitions prefixed with "_" are private and cannot be loaded, but everything else is visible. When a file is loaded, it's global state is recursively frozen. This means that everything you load from another file is immutable. This enables Bazel to load files in parallel and cache their contents. It also frees you from having to worry about the order in which files or loaded or weird side effects from loading unrelated files (which is a big problem with Makefiles).
load("//:build_defs.bzl", "gy_library") load(":build_defs.bzl", "py_gen_file", "doxygen_archive", "gy_test_name")
Build phases
Bazel does its work in three phases: loading, analysis, and execution. It's important to be aware of these phases in order to understand where rules and macros fit in and what they can do.
During the loading phase, Bazel loads and evaluates BUILD and .bzl files. Bazel will only evaluate files for packages requested on the command line and files loaded with the load
function. The result of the loading phase is the rule dependency graph. Each vertex in this graph is a rule (declared with cc_library
, etc.). Each edge is a dependency, e.g., a cc_library
depends on source files and other libraries.
The analysis phase translates the rule dependency graph into a file dependency graph. The file graph is a relation between build artifacts and their dependencies. Each build artifact has an action that specifies how to build it. Unlike the rule graph, the file graph contains intermediate results that aren't explicitly mentioned in BUILD files. For example, think about how a cc_binary
is specified and built. The rule declares the resulting executable, the sources, and the dependencies. The rule implementation adds the binary file and the intermediate .o files to the file graph (which are never mentioned in BUILD files), along with compile and link actions that produce them.
During the execution phase, actions are executed in order to produce the targets specified by the user. Bazel scans the file graph for files that are missing or out of date, then runs the associated actions in a sandboxed environment. The sandbox prevents an action from referencing dependencies that Bazel does not know about.
Macros and rules
You can extend Bazel by writing new macros and rules in a .bzl file. A macro is a function that instantiates one or more rules. This is useful for eliminating repetition in build files. For example, if you want to build several similar executables (py_binary
), then generate source files with them (genrule
), you can use a macro. Macros are evaluated during the loading phase.
Macros are declared like regular Python functions. When you instantiate a rule inside a macro, you must use the "native.
" prefix.
def py_gen_file(name, script, data, out): script_name = "gen_" + name native.py_binary( name = script_name, srcs = [script], main = script, deps = ["@yaml//:yaml"], ) native.genrule( name = name, srcs = [data], tools = [script_name], outs = [out], cmd = "python $(location :%s) $< $@" % script_name, )
Macros are instantiated in BUILD files by loading them and calling them like functions.
load(":build_defs.bzl", "py_gen_file") py_gen_file( name = "builtins", script = "src/gen_builtins_h.py", data = "//:common/builtins.yaml", out = "src/builtins.h", )
A rule adds new functionality to Bazel by providing a new way to convert part of the rule graph into part of the file graph. The rule implementation can declare new build artifacts and can attach actions to those artifacts. Rules are declared during the loading phase, but their implementations are executed during the analysis phase. Rules are powerful, but some knowledge of Bazel is required to use them correctly.
To define a new rule, the rule
function must be called with a rule implementation function, a dict of attributes, and a dict of outputs. Attributes are arguments that get passed to the rule when it is instantiated. Usually these are labels for sources and dependencies (srcs
, deps
), but they can be things like compiler or linker flags.
Here's a simple rule I defined to generate Doxygen documentation for CodeSwitch. It assumes Doxygen is already installed on the host machine. It substitutes the string "@@OUTPUT_DIRECTORY@@
" in the given Doxyfile, executes Doxygen with the substituted file, then packages the result in a .tar.gz file.
def _doxygen_archive_impl(ctx): """Generate a .tar.gz archive containing documentation using Doxygen. Args: name: label for the generated rule. The archive will be "%{name}.tar.gz". doxyfile: configuration file for Doxygen srcs: source files the documentation will be generated from. """ doxyfile = ctx.file.doxyfile out_file = ctx.outputs.out out_dir_path = out_file.short_path[:-len(".tar.gz")] commands = [ "mkdir -p %s" % out_dir_path, "out_dir_path=$(cd %s; pwd)" % out_dir_path, "pushd %s" % doxyfile.dirname, "sed -e \"s:@@OUTPUT_DIRECTORY@@:$out_dir_path/codeswitch-api/:\" <%s" + " | doxygen -" % doxyfile.basename, "popd", "tar czf %s -C %s codeswitch-api" % (out_file.path, out_dir_path), ] ctx.action( inputs = ctx.files.srcs + [doxyfile], outputs = [out_file], command = " && ".join(commands), ) doxygen_archive = rule( implementation = _doxygen_archive_impl, attrs = { "doxyfile": attr.label( mandatory = True, allow_files = True, single_file = True), "srcs": attr.label_list( mandatory = True, allow_files = True), }, outputs = { "out": "%{name}.tar.gz", }, )
The implementation function does the heavy lifting. It takes a ctx
argument, which provides information about the rule and its attributes. ctx.file
provides access to single-file attributes like doxyfile
. ctx.files
provides access to multi-file attributes like srcs
. ctx.outputs
provides access to the outputs for the rule (this matches the outputs
dict, specified in the call to the rule
function). Every file in outputs
must have an action that generates it. ctx.action
creates a new action which is either an executable with arguments or a shell command (as in this case).
gy_library
and gy_binary
Let's look at some more interesting rules: the rules that actually add Bazel support for Gypsum. gy_library
builds a Gypsum package from source, using the Gypsum compiler. We'll start with the dict of attributes for this rule. These attributes are shared with gy_binary
, so they are declared separately.
_gy_attrs = { "package_name": attr.string(default="default"), "package_version": attr.string(default="0"), "srcs": attr.label_list(allow_files=[".gy"]), "deps": attr.label_list(providers=[ "gy", "transitive_pkg_dirs", ]), "data": attr.label_list(allow_files=True, cfg="data"), "native_lib": attr.label(cfg="target", providers=["cc"]), "flags": attr.string_list(), "_gy_compiler": attr.label( executable=True, cfg="host", default=Label("//gypsum:compiler")), }
package_name
and package_version
are passed as arguments to the compiler. srcs
is the list of Gypsum source files that comprise the package. deps
is a list of Gypsum packages that the package being compiled depends on. native_lib
is an optional dependency on a shared library that CodeSwitch can load dynamically. flags
is a list of other flags that should be passed to the compiler. _gy_compiler
is an implicit dependency on the Gypsum compiler. Attributes that start with "_" are implicit and must have a default value.
Note the providers
argument for some of the labels. Providers pass information from a rule to other rules that depend on it. Specifying a list of providers for an attribute requires dependencies to provide that information. In the case of deps
, we want other Gypsum rules, hence the "gy"
provider. In the case of native_lib
, we want a C++ rule, hence the "cc"
provider.
Here's the call to the rule
function for gy_library
:
gy_library = rule( implementation = _gy_library_impl, attrs = _gy_attrs, outputs = {"pkg": "%{package_name}-%{package_version}.csp"}, fragments = ["cpp"], )
This is pretty straightforward. The package file is our only output. Its name is based on the package_name
and package_version
attributes. We need access to the cpp
fragment, since we use platform information to determine an appropriate file extension for the native library.
Here's the implementation function and the helper functions that do most of the work.
def _gy_library_impl(ctx): _compile_gy_package(ctx) return _prepare_gy_providers(ctx) def _compile_gy_package(ctx): args = [ "--package-name", ctx.attr.package_name, "--package-version", ctx.attr.package_version, "--output", ctx.outputs.pkg.path, ] inputs = [] for f in ctx.files.deps: args.append("--depends") args.append(f.path) inputs.append(f) args += [f.path for f in ctx.files.srcs] inputs += ctx.files.srcs args += ctx.attr.flags ctx.action( inputs = inputs, outputs = [ctx.outputs.pkg], arguments = args, executable = ctx.executable._gy_compiler, progress_message = "Compiling Gypsum package %s" % ctx.outputs.pkg.path, ) _gy_provider = provider() def _prepare_gy_providers(ctx): files = [ctx.outputs.pkg] pkg_dir_name = _short_dirname(ctx.outputs.pkg) transitive_pkg_dirs = set([pkg_dir_name]) symlinks = {} package_name = ctx.attr.package_name package_version = ctx.attr.package_version package_file = [ctx.outputs.pkg] native_lib_file = None if ctx.attr.native_lib: native_lib_file = [f for f in ctx.files.native_lib if f.extension == "so"][0] files.append(native_lib_file) cpu = ctx.fragments.cpp.cpu lib_ext = "dylib" if cpu == "darwin" else "so" link_path = ("%s/lib%s-%s.%s" % (pkg_dir_name, package_name, package_version, lib_ext)) if link_path != native_lib_file.short_path: symlinks[link_path] = native_lib_file gy = _gy_provider( package_name = package_name, package_version = package_version, package_file = package_file, native_lib_file = native_lib_file, ) if hasattr(ctx, "_codeswitch_cmd"): files += ctx.attr._codeswitch_cmd runfiles = ctx.runfiles( files = files, collect_data = True, symlinks = symlinks) for dep in ctx.attr.deps: transitive_pkg_dirs += dep.transitive_pkg_dirs runfiles = runfiles.merge(dep.data_runfiles) return struct( gy = gy, transitive_pkg_dirs = transitive_pkg_dirs, runfiles = runfiles, ) def _short_dirname(f): short_path = f.short_path i = short_path.rfind("/") return short_path[:i] if i != -1 else "."
_compile_gy_package
inserts an action into the file graph that actually compiles the Gypsum package. This is done by calling ctx.action
with an executable (the Gypsum compiler), a list of inputs (source files and dependencies), a list of outputs (the package being compiled), and a list of arguments (package name, version, flags, etc.).
_prepare_gy_providers
gathers information to be passed to other rules. There are three important providers here.
gy
- information about the Gypsum package being built. Includes name, version, path, and native library file (if present). This isn't actually used anywhere, but I expect it will be useful for other tools.transitive_pkg_dirs
- a set of all directories that CodeSwitch may need to search to load packages by name. This includes directories of transitive dependencies.runfiles
- information about files that must be present when a program that depends on a Gypsum package is run. This includes Gypsum packages, the CodeSwitch interpreter, and native libraries (and appropriate symlinks to them). It also contains files referenced in thedata
attribute of a Gypsum rule.
Hopefully that made at least a small amount of sense. Let's look at the gy_binary
rule next.
gy_binary = rule( implementation = _gy_binary_impl, executable = True, attrs = _gy_attrs + { "_codeswitch_cmd": attr.label( executable=True, allow_single_file=True, cfg="target", default=Label("//codeswitch:codeswitch_cmd")), }, outputs = { "pkg": "%{package_name}-%{package_version}.csp", }, fragments = ["cpp"], ) def _gy_binary_impl(ctx): # Build package _compile_gy_package(ctx) providers = _prepare_gy_providers(ctx) # Generate runner script. args = [ctx.file._codeswitch_cmd.short_path] for pkg_dir in providers.transitive_pkg_dirs: args += ["--package-path", "'%s'" % pkg_dir] args += ["--package", "'%s'" % ctx.attr.package_name] command = "exec " + " ".join(args) ctx.file_action( output = ctx.outputs.executable, content = command, executable = True ) return providers
The rule definition is pretty much the same as gy_library
, but there's an additional implicit dependency on the CodeSwitch interpreter, which is a cc_binary
defined elsewhere. The executable
argument indicates this rule will define an action for the file ctx.outputs.executable
in addition to its other outputs.
The rule implementation compiles a Gypsum package and builds providers the same way that _gy_library_impl
does. It also creates a shell script in ctx.outputs.executable
that starts the interpreter with the appropriate packages and paths.
Gypsum rules in action
gy_library
is used within to build the standard library and several test cases for CodeSwitch. Here is the BUILD file for std.io
:
load("//:build_defs.bzl", "gy_library") package(default_visibility = ["//visibility:public"]) gy_library( name = "io", package_name = "std.io", package_version = "0", srcs = glob(["src/*.gy"]), deps = ["//std"], native_lib = ":std.io-native", ) cc_library( name = "std.io-native", srcs = glob(["src/*.cpp"]), deps = ["//codeswitch:codeswitch"], )
gy_binary
is used to build the programs in the examples/
directory. Here's the BUILD file from there:
load("//:build_defs.bzl", "gy_binary") [gy_binary( name = source_file[:source_file.rindex(".")], package_name = source_file[:-3].replace("-", ""), srcs = [source_file], deps = ["//std"] ) for source_file in glob(["*.gy"])]
Conclusion
Migrating Gypsum and CodeSwitch to Bazel was a fairly significant effort, but I'm happy with the result. Bazel is a really powerful system. I think once a little more functionality is exposed in Skylark, it will be ideal for cross-language, cross-platform projects.
It took me longer than I would have liked to come up to speed on Bazel and Skylark. Community documentation (blog posts, StackOverflow questions, etc) is a little thin, which is to be expected since Bazel hasn't been out for very long and is still pre-1.0. Maybe these articles will help with that.
If you're interested in writing your own Bazel extensions, start with extensions examples on the main Bazel site. You may also want to look at some of the extensions for other languages, like rules_go or rules_rust. I'm one of the maintainers for the Go rules, so keep an eye on that.