Package loading in CodeSwitch

Published on 2015-07-27
Tagged: codeswitch

A few weeks ago, I announced that Gypsum and CodeSwitch now have support for packages. It's time to go into a bit more detail about how CodeSwitch deals with packages.

For those who haven't been following, Gypsum is my experimental programming language, and CodeSwitch is the virtual machine that runs it. Both can be found on GitHub.

Contents of packages

CodeSwitch manages code in chunks called packages. Each package is stored in a separate file. A package has a name, a version, and a list of other packages it depends on (dependencies). Each dependency has a name, a minimum version, and a maximum version (both versions are optional).

Packages contain definitions, which fall into the following categories:

Globals - variables and constants, accessible in every function.
Functions - described using bytecode. Can be called through the API or by other functions.
Classes - blueprints for building objects. Each class has a list of fields, methods, and constructors (which are just functions).
Type parameters - symbols that represent unknown types between an upper and lower bound. Used to implement "generics" in functions and classes.

This list will include interfaces in the future, once they are added to Gypsum.

Definitions within a package may have a PUBLIC flag associated with them. Public definitions can be used by other packages.

The package loader

The package loader loads, links, and initializes packages from files and stores them in the VM. The package loader's work starts when a client asks CodeSwitch to load a package by its file name through the API.

The package loader deserializes the package into memory and places the resulting Package object onto the loading stack. The package loader then starts loading dependencies in a loop. In each iteration of the loop, it pops a package off the loading stack. If the package has no dependencies or all its dependencies have already been loaded, it's linked and moved to the initialization queue. Otherwise, it's pushed back on the loading stack, and its next dependency is deserialized and pushed onto the loading stack as well.

Once the loading stack is empty, the package loader iterates over the initialization queue and executes each package's initialization function. Initialization functions are mostly just responsible for setting the initial values of global variables. After a package has been initialized, it is added to the VM's list of loaded packages.

A bit of pseudocode might make the loading process more clear (or less, depending on how readable you find my pseudocode).

def load(vm, package-file-name)
  // Check if the package was already loaded.
  if âˆƒ package âˆˆ vm.packages s.t. package.name == package-name
    return

  // Load the package into memory.
  let package = deserialize-package(package-file-name)

  // Push the package onto the loading stack.
  let loading-stack = Stack()
  let initialization-queue = Queue()
  loading-stack.push((package, 0))

  // Load and link all of the package's dependencies.
  while !loading-stack.is-empty
    let package, dependency-index = loading-stack.pop()
    if dependency-index == package.dependencies.length
      // All dependencies processed.
      link(package)
      initialization-queue.enqueue(package)
    else
      // Load the next dependency, if it hasn't been loaded already.
      loading-stack.push(package, dependency-index + 1)
      let dependency = package.dependencies[dependency-index]
      if âˆƒ package âˆˆ loading-stack
           s.t. package.name == dependency.name
        throw Error("circular package dependency")
      if âˆ„ package âˆˆ (initialization-queue âˆª vm.packages)
                   s.t. package.name == dependency.name)
        let dep-file-name = locate-package(dependency.name)
        let dep-package = deserialize-package(dep-file-name)
        loading-stack.push((dep-package, 0))

  // TODO: validation

  // Initialize each package.
  while !initialization-queue.is-empty
    let package = initialization-queue.dequeue()
    if package has initializer
      vm.interpret(package.initializer)
    vm.packages.add(package)

Note that CodeSwitch doesn't yet support circular dependencies between packages. I'm sure how important it is to allow circular dependencies, but this limitation will probably be removed in the future.

Linking

When a package is compiled, each external definition (in another package) that is used gets stubbed and stored inside the PackageDependency object for the package it's defined in. Remember how I mentioned each package has a list of dependencies with names and versions? Those dependencies are represented by the same PackageDependency objects.

Linking is the process of correlating the external definition stubs in PackageDependency objects with the actual definitions in other Package objects. The process is fairly simple for now. Before linking, we build an export map for each dependency. An export map is a hash table that maps definition names to public definitions. Then for each external definition in each dependency, we look up the definition in the dependency's export map.

This process has a couple problems I'd like to fix in the future. First, export maps are keyed by name only. This means that two public definitions cannot have the same name, even if they are different kinds of definitions (functions and classes, for example). This isn't a problem for Gypsum, but it may be for other languages. It also means that public functions cannot be overloaded. Second, linking is completely eager. A lot of work is done up front, even if most external definitions aren't used at run-time. The linking process should be lazy; an external definition should only be linked the first time it's used.

TODO: Validation

Package validation is notably absent from the loading process. CodeSwitch trusts that whatever code it reads obeys all the various undocumented invariants it expects without checking. This isn't by design; I just haven't written the validation code yet. Once the package format stops changing so quickly, I'll document it and write the validation logic in Gypsum (it would be a pain to write it all in C++).

Validation will need to happen before initialization, since no code in a package should be executed before being validated. It will need to happen after linking, since I'll need to check that external definitions actually have the expected types and that there are no inheritance cycles that cross package boundaries.

Linking needs to be eager in order for this to work. Since validation and eager linking are both big costs that will increase load time, so I plan to perform validation only on newly installed packages. API clients will probably also be able to skip validation if they trust the package for some other reason. For example, if we built a JavaScript engine using CodeSwitch, we would just glue a JavaScript compiler to CodeSwitch; packages would be generated and consumed by the same program, so we wouldn't want to waste time validating.