Life of a Go module

Published on 2021-03-26
Tagged: git go modules

Go's module system is designed to be decentralized. Although there are public mirrors like proxy.golang.org, there is no central module registry. An author can publish a new version of their module by creating a tag in the module's source repository.

$ git tag v1.2.3
$ git push --tags

A user can download and use that new version right away.

$ go get -d example.com/mod@v1.2.3

It's cool this is so automatic, but how exactly does it work? What does the go command download, and from where?

Configuring module downloads

When the go command needs a module that's not in its local cache, it can either download the module from the source repository (direct mode), or it can download the module from a proxy, also known as a mirror. You can control how modules are downloaded by setting GOPROXY, GOPRIVATE, and a few other environment variables. The default setting of GOPROXY is:

GOPROXY=https://proxy.golang.org,direct

This tells the go command to attempt to download modules first from proxy.golang.org, the public module mirror operated by Google. If a module isn't available there (indicated by a 404 or 410 HTTP status), the go command falls back to direct mode. That usually happens because the module is in a private repository that's not visible to proxy.golang.org.

You can enable direct mode for specific modules by setting GOPRIVATE to a list of patterns matching prefixes of those modules (for example, GOPRIVATE=corp.example.com). You can enable direct mode for all modules by setting GOPROXY=direct.

Downloading in direct mode

Let's look at how direct mode works before we get into proxy details. Direct mode is the basis for most proxy implementations. Module files have to come from somewhere, after all.

Finding the module repository

The go command needs to clone a repository into the module cache. Before it can do that, it needs to look up the repository's URL.

Many modules are hosted on GitHub, and their URLs are derived directly from modules paths. For example, github.com/foo/bar becomes https://github.com/foo/bar.git. This rule is hard-coded into the go command, along with rules for a couple of other services.

For modules outside those services, there are two ways to find the repository URL. First, the URL may be encoded directly into the module path. A fully qualified path has an element that ends with .git, .hg, .svn, .bzr, or .fossil. The go command can derive the repository URL directly from one of these paths. For example, example.com/repo.git/mod would be hosted at https://example.com/repo.git or ssh://example.com/repo.git.

Second, the go command can also look up the URL for a custom module path (also known as a vanity path) by sending an HTTP GET request to a URL derived from the module's path. You've likely seen this for modules at golang.org, gopkg.in, or k8s.io. The request has the query string ?go-get=1 to distinguish it from other queries. For example, for the module golang.org/x/net, the go command sends a request for https://golang.org/x/net?go-get=1. It looks for an HTML <meta> tag in the response with the attribute name="go-import".

$ curl -L https://golang.org/x/net?go-get=1 | grep go-import
<meta name="go-import" content="golang.org/x/net git https://go.googlesource.com/net">

The content string in this tag has three fields separated by spaces: the root path (the prefix of the module path corresponding to the repository root), the version control tool (git, hg, svn, bzr, fossil, or mod), and the repository URL.

Custom paths are a nice option since you can change where your module is hosted without renaming it. However, if you have a private module, and you can't easily stand up an HTTP server (for example, on a restricted corporate network), then a qualified path is probably your best option.

Extracting an archive from a repository

After the go command has located the repository, it makes a local clone of the repository in the module cache using the appropriate tool like git,which must be installed and configured. Configuration is especially important for user credentials since the go command invokes git non-interactively, and you won't have a chance to enter a password. The Go FAQ has some advice on this.

Once the go command has cloned the repository, it creates an archive for the requested version using a command like git archive. This archive may contain unnecessary files, especially the module is in a subdirectory of the repository, or if there are other nested modules. To remedy this, the go command copies each of the module's files from the repository archive into a separate zip file.

After the module zip file is verified, the go command extracts it into the module cache. The module's packages can then be built.

If you're curious or need to debug this process, you can see the git commands run by go get and go mod download by passing in the -x flag. You can also read more about version control systems in the module reference documentation.

Downloading from a proxy

The go command can download modules from a proxy using an HTTP-based protocol. This is typically 5-20x faster than downloading modules from a source repository.

The GOPROXY protocol was designed to be stateless and is simple enough to be implemented with a static file server. The path structure matches the directories in the module cache, so you can actually use a module cache as a proxy with a file:// URL.

Proxies support the following endpoints:

Path	Description
`$module/@v/list`	Returns a list of known versions of the given module in plain text, one per line. This list should not include pseudo-versions.
`$module/@v/$version.info`	Returns JSON-formatted metadata about a version or a branch or tag name that resolves to a version. The JSON data contains a canonical version, and an optional timestamp.
`$module/@v/$version.mod`	Returns the go.mod file for a specific version of the module. If the module doesn't have a go.mod file, this endpoint should return a file with a `module` directive and nothing else.
`$module/@v/$version.zip`	Returns the content of the module for a specific version.
`$module/@latest`	Returns JSON metadata for the version of a module that the `go` command should use as `@latest` if the `$module/@v/list` endpoint is empty or contains no suitable versions. The returned metadata is in the same format as `$module/@v/$version.info`. This endpoint is optional. Not all proxies implement it.

Downloading a module from a proxy may be much faster than downloading the same module from its source repository for two reasons evident from this protocol. First, the go command doesn't need to download an entire repository or even an entire commit. The .zip endpoint provides a snapshot of one module at one version and nothing more. Second, unless a module's packages are actually needed for a build, the go command only needs to download the .mod file for version selection; it can skip downloading the .zip file.

Let's pretend we're the go command and walk through the process of fetching the latest version of a module with curl. You can also visit these URLs in your browser. Suppose we're running the command go get golang.org/x/mod@latest.

First, we need the list of versions.

$ curl -L https://proxy.golang.org/golang.org/x/mod/@v/list
v0.3.0
v0.4.0
v0.4.1
v0.1.0
v0.2.0
v0.4.2

I don't know why they're not sorted. Anyway, v0.4.2 is the highest version at the time of this writing.

We'll fetch its metadata. Note that for a canonicalized version like v0.4.2, the metadata isn't that useful, and it's not strictly necessary for the go command to fetch it. It would be more important if we wanted to check what version a branch name corresponds to.

$ curl -L https://proxy.golang.org/golang.org/x/mod/@v/v0.4.2.info
{"Version":"v0.4.2","Time":"2021-03-09T22:22:12Z"}

Next, we'll fetch the .mod file:

$ curl -L https://proxy.golang.org/golang.org/x/mod/@v/v0.4.2.mod
module golang.org/x/mod

go 1.12

require (
  golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550
  golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e
  golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898
)

And finally, the .zip file:

$ curl -L -O https://proxy.golang.org/golang.org/x/mod/@v/v0.4.2.zip
$ unzip -l v0.4.2.zip | head
Archive:  v0.4.2.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1479  1980-00-00 00:00   golang.org/x/mod@v0.4.2/LICENSE
     1303  1980-00-00 00:00   golang.org/x/mod@v0.4.2/PATENTS
      660  1980-00-00 00:00   golang.org/x/mod@v0.4.2/README.md
       21  1980-00-00 00:00   golang.org/x/mod@v0.4.2/codereview.cfg
      214  1980-00-00 00:00   golang.org/x/mod@v0.4.2/go.mod
     1476  1980-00-00 00:00   golang.org/x/mod@v0.4.2/go.sum
     5224  1980-00-00 00:00   golang.org/x/mod@v0.4.2/gosumcheck/main.go

Implementing a proxy

At this point, you might be wondering where a proxy gets the modules it serves, perhaps so you can run your own proxy. There are a few different ways. You could build a static proxy that serves modules from a directory you populate manually with go mod download in direct mode.

export GOMODCACHE=/srv/modcache
mkdir -p $GOMODCACHE
export GOPROXY=direct
go mod download example.com/mod@v1.2.3
# serve files from /srv/modcache/cache/download

You could also build a proxy that serves files from a module cache and runs go mod download on each cache miss. You could scale that with multiple instances using shared storage.

If you're interested in running a private proxy on your own network, check out The Athens Project.

Verifying downloaded modules

By default, the go command downloads publicly available modules from proxy.golang.org, a module mirror operated by Google. Anyone can operate a proxy though, which leads to an interesting security question: how can you verify that the modules you download from a proxy are genuine? Actually, the same question applies in direct mode: how do you know the repository you cloned hasn't been tampered with?

The go command uses two mechanisms to ensure downloaded files haven't changed since they were first downloaded from the source repository: go.sum files, and the global checksum database.

go.sum

Each module has a go.sum file stored next to its go.mod file. go.sum contains a list of hashes of .mod and .zip files for the module's dependencies. It looks like this:

golang.org/x/mod v0.4.1 h1:Kvvh58BN8Y9/lBi7hTekvtMpm07eUZ0ck5pRHpsMWrY=
golang.org/x/mod v0.4.1/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=

Each line has three fields: a module path, a version, and a base64-encoded SHA-256 sum. If the version has a /go.mod suffix, the sum is for the .mod file; otherwise it's for the .zip file. Instead of hashing the .zip file itself, the go command hashes its files in a deterministic order. Consequently, the hash isn't sensitive to file order, compression, alignment, or metadata. Unfortunately, this violates the Cryptographic Doom Principle.

When the go command downloads a file, it hashes it and checks go.sum. If go.sum contains a different hash, the go command reports a security error. If go.sum does not contain a hash for the file, the go command trusts the file (perhaps after consulting the checksum database) and adds the hash.

This ensures that if multiple people are working together on the same module, they'll be downloading the same set of dependencies. The go command reports an error if a malicious proxy serves different files, or if a repository is taken over and its version tags are changed.

Checksum database

go.sum doesn't completely address the threat. How can you verify a file is authentic the first time you download it, when go.sum doesn't have a hash?

To answer this, Google operates sum.golang.org, an auditable checksum database. The checksum database functions a little like a giant go.sum file for all versions of all publicly available modules. The go command consults this database when downloading files that don't have hashes in go.sum. (Modules matched by GOPRIVATE or GONOSUMDB won't be checked).

If you're interested in learning more, check out Proposal: Secure the Public Go Module Ecosystem, which describes how the system works and discusses engineering tradeoffs.

Conclusion

Go modules are a lot easier to manage than GOPATH was, at least in my opinion. But that ease comes with a tradeoff: magic. There's a lot of hidden complexity that makes problems difficult to fix (or to explain) when something goes wrong.

I've been working on modules in the go command for a little over two years now. We've improved the user experience quite a bit in that time, and I've personally written a lot of documentation. I still think it's hard for people to understand what's going on, particularly when things don't "just work". We'll keep improving though, and I think we'll end up with a great experience while preserving that ease.

If you're interested in learning more about Google's module proxy and checksum database, I'd highly recommend Katie Hockman's GopherCon 2019 talk, Go Module Proxy: Life of a Query. Katie led the team that built these services, and she presents their design in a very accessible form.