Minibox: A miniature Linux container runner
I've been curious about how Linux containers work for a long time. I've played around with Docker, and it basically seems like magic. I decided to learn more about them by writing my own tiny container implementation.
Background
Containers in Linux are implemented using namespaces, which are a relatively new virtualization mechanism. I mean virtualization in the computer science sense here: isolating a process's view of the rest of the system from other processes. Some resources have always been virtualized in Linux: access to memory (virtual memory) and the CPU (preemptive scheduling). Most global system resources were not virtualized before namespaces came along: all processes had the same view of the file system, the network, user IDs, and IPC.
In 2002, mount namespaces were added. Each mount namespace has its own mount table, so processes in different mount namespaces have different views on which filesystems are mounted, and where. Additional namespaces were added after 2006: there are now namespaces for process IDs, network, interprocess communication, UTS (host and domain name), user IDs, and control groups.
Namespaces are the foundation of containers. A container is a collection of programs and resources that run in isolation from the rest of the processes on a system. Docker, at its lowest level, is a layer of packaging and configuration on top of the raw functionality built in the kernel.
Minibox
In order to understand containers better, I built a tiny, crappy version of Docker. You can find it at github.com/jayconrod/minibox.
My goal was to package a statically linked program and some related files in an ext2 disk image, mount that at the file system root inside a mount namespace, then execute the program. All configuration is done on the command line. I only used a mount namespace; processes in this container are not isolated from the network or anything else.
I implemented everything in Go. Go provides easy access to system calls via wrappers in golang.org/x/sys/unix. There were a few things that didn't quite work, so I dropped down into C in a couple places. I avoided using C as much as possible though since string manipulation and memory management are an absolute pain.
Demo
Before we dive into the code, I'll show how to use this thing. You'll need to have Go installed, and you'll need root access on your system.
First, download the project.
$ go get -d -u github.com/jayconrod/minibox $ cd $GOPATH/src/github.com/jayconrod/minibox
Build the program that will run in the container. This can be any program. I wrote something simple that prints out some environmental information, lists files in its directory, then exits.
$ go build list-files.go
Run the bash script to create a disk image. This script fills a 32MB file with zeroes, formats it with mkfs
, mounts it at /mnt
, copies list-files
there, creates a few other files in there (just so list-files
has something to look at), then unmounts the image. The script needs to be run with sudo
because it uses mount.
$ sudo ./build-image.bash
Next, build minibox
, the container runner.
$ go build minibox.go
Run it like this:
$ sudo ./minibox \ -image mini.ext2 \ -fstype ext2 \ -dir /mnt \ -entry /list-files \ -uid 1000 \ -gid 1000
You should see this:
invoked as: /list-files uid: 1000 gid: 1000 environment: files: . bar baz foo list-files lost+found
Implementation
Ok, let's dive into the implementation of our container runner. This program sets up the container, executes the program inside it, then tears down anything that needs to be torn down. The container runner must run as root, since nearly all of the system calls it uses are privileged.
Step 1: Create new namespaces for the current process with unshare
. CLONE_NEWNS
creates a new filesystem namespace; CLONE_FS
isolates the root directory and the current directory from other processes. You can use the same flags with clone if you want to create a new namespace and a new process at the same time.
if err := unix.Unshare(CLONE_FS | CLONE_NEWNS); err != nil { return 1, errors.Wrap(err, "unshare") }
Step 2: Mount the disk image. This was more complicated than I thought it would be. When you mount a disk image on the command line, you can pass the disk image directly to the mount
program, and it knows what to do. The mount
system call is not as smart though; it can only mount device files. So we need to configure a loop device and mount that. I'll break this into substeps. I translated this from the example on the loop(4) man page.
Step 2a: open
/dev/loop-control
and find a free loop device. Linux has a number of loop devices in /dev
. Some of them may be in use, so you can use /dev/loop-control
to find which one to use.
If you aren't familiar with Linux device files, you can open, read, and write them as if they were normal files. These operations are handled by device drivers in the kernel. Those drivers can send signals to the hardware to do something like playing sound or sending packets on the network in response to reads and writes. Of course, you can't do everything with read
and write
. For small miscellaneous operations, there's ioctl
, which we use below. ioctl
takes an open file descriptor, a request number (which has some device-specific meaning), and more untyped arguments interpreted by that request. Most requests simply get or set a number, so Go provides two wrappers for ioctl
: IoctlGetInt
and IoctlSetInt
.
loopctlFd, err := unix.Open("/dev/loop-control", syscall.O_RDWR, 0) if err != nil { return 1, errors.Wrapf(err, "open /dev/loop-control", err) } defer closeFn("/dev/loop-control", loopctlFd) devNum, err := unix.IoctlGetInt(loopctlFd, LOOP_CTL_GET_FREE) if err != nil { return 1, errors.Wrap(err, "ioctl LOOP_CTL_GET_FREE") }
Step 2b: open
the disk image and bind its file descriptor to /dev/loopN
, where N is the number we got from /dev/loop-control
. The binding is done with another ioctl
, LOOP_SET_FD
. Once bound, /dev/loopN
will act like a disk with the disk image as its backing store. We'll need to do LOOP_CLR_FD
when we clean up later.
loopDevName := fmt.Sprintf("/dev/loop%d", devNum) loopFd, err := unix.Open(loopDevName, syscall.O_RDWR, 0) if err != nil { return 1, errors.Wrapf(err, "open %s", loopDevName) } defer closeFn(loopDevName, loopFd) imageFd, err := unix.Open(image, syscall.O_RDWR, 0) if err != nil { return 1, errors.Wrapf(err, "open %s", image) } defer closeFn(image, imageFd) if err := unix.IoctlSetInt(loopFd, LOOP_SET_FD, imageFd); err != nil { return 1, errors.Wrap(err, "ioctl LOOP_SET_FD") } defer func() { _, clearErr := unix.IoctlGetInt(loopFd, LOOP_CLR_FD); if clearErr != nil && err == nil { err = clearErr } }()
Step 2c: mount
the loop device like a normal disk.
if err := unix.Mount(loopDevName, dir, fstype, 0, ""); err != nil { return 1, errors.Wrap(err, "mount") }
Step 3: Create a child process with fork
, and wait for it to complete with wait4
. fork
creates a new process by duplicating the calling process. The child process will eventually execute the program inside the container with execve
, but we need to make a few other system calls first.
Unfortunately, Go doesn't provide a standalone version of fork
; it only has ForkExec
, which glues fork
and execve
together. As far as I understand, this is because fork
only duplicates the calling thread, and the Go runtime has some background threads (garbage collector?) that it needs to keep running. When I asked my coworkers about this, they said don't worry about it, just use RawSyscall
, YOLO. RawSyscall
it is then. It seems to work well enough for this demo, but I wouldn't use this in production code.
pid, _, _ := unix.RawSyscall(uintptr(C.SYS_fork), 0, 0, 0) if pid < 0 { return 1, errors.New("fork") }
After the fork, the parent process waits for the child process to exit or crash with wait4
. wait4
also returns when the child process is suspended or resumed. We don't care about that, so we need to call it in a loop and check why it returned.
for { var status unix.WaitStatus if _, err = unix.Wait4(int(pid), &status, 0, nil); err != nil { return 1, errors.Wrap(err, "wait4") } if status.Signaled() { return 1, errors.Errorf("process terminated by signal %v", status.Signal()) } if status.Exited() { return status.ExitStatus(), nil } if status.Stopped() || status.Continued() { continue } return 1, errors.Errorf("unknown return from wait: %x", status) }
The remaining steps occur inside the child process.
Step 4: Make the mounted container image the new file system root.
Step 4a: Create a directory called .old_root
inside the container image.
oldRootDir := filepath.Join(dir, ".old_root") os.Mkdir(oldRootDir, 0700)
Step 4b: Call pivot_root
. This makes the container image the new root of the file system and moves the old root to .old_root
. Also, change the current directory to /
. pivot_root
is vaguely specified and may leave the current directory in an indeterminate state, so it's best to set it explicitly.
if err := unix.PivotRoot(dir, oldRootDir); err != nil { log.Fatal(errors.Wrap(err, "pivot_root")) } if err := os.Chdir("/"); err != nil { log.Fatal(errors.Wrap(err, "chdir")) }
Step 4c: unmount
the old root file system and remove the .old_root
directory. At this point, the old file system should no longer be visible.
if err := unix.Unmount("/.old_root", MNT_DETACH); err != nil { log.Fatal(errors.Wrap(err, "unmount")) } if err := os.Remove("/.old_root"); err != nil { log.Fatal(errors.Wrap(err, "remove")) }
Step 5: Drop privileges with setgid
and setuid
. setgid
needs to be called first because both calls require root privilege. These system calls are both in golang.org/x/sys/unix
, but their implementations just return an "operation not supported" error instead of doing something useful. RawSyscall
once again, I guess.
ret, _, _ := unix.RawSyscall(uintptr(C.SYS_setgid), uintptr(gid), 0, 0) if ret < 0 { log.Fatal("setgid") } ret, _, _ = unix.RawSyscall(uintptr(C.SYS_setuid), uintptr(uid), 0, 0) if ret < 0 { log.Fatal("setuid") }
Step 6: Execute the program inside the container. execve
is the system call we want to use. This executes a program in the current process (the child process). Most of the process's state (virtual memory) is dropped and replaced with the new program. Some state is preserved: file descriptors are left open, so the child process can still read and write stdin and stdout. Command line arguments and environment variables are passed in explicitly through the execve
call.
Go doesn't provide a wrapper for execve
(other than ForkExec
), and passing string arguments through RawSyscall
didn't sound like fun to me. So I ended up writing my own wrapper in cgo. I joined the arguments into a single string with a NUL
byte after each argument.
Here's the Go side of things:
cEntry := C.CString(entry) cArgc := C.int(len(flag.Args())) cArgstr := C.CString(strings.Join(flag.Args(), "\x00") + "\x00") C.execWrapper(cEntry, cArgc, cArgstr)
And the C side:
void execWrapper(char* path, int argc, char *argstr) { char** argv = malloc((argc+2) * sizeof(char*)); argv[0] = path; for (int i = 0; i < argc; i++) { argv[i+1] = argstr; argstr += strlen(argstr) + 1; } argv[argc+1] = NULL; execve(path, argv, NULL); free(argv); perror("execv"); exit(1); }
Conclusion
This was a fun demo to write. I learned quite a bit about how containers are implemented, and I got to play with system calls for the first time in a while.
Once again, you can find the full implementation at github.com/jayconrod/minibox. If you decide to hack on this, here are a few things to keep in mind:
- This implementation provides basically no security. If the container image has a setuid root program, it's easy to regain root and escape the container. You can harden this a little by mounting with the
MS_NOSUID
option and create a uid namespace. Personally, I'm not confident I could create a secure implementation without a lot more experience; there are a lot of subtleties here. - Anything that runs inside the container needs to be statically linked (unless you want to install the dynamic loader and a bunch of .so files in the container image, too). Pure Go programs are great for this, but once you mix in some cgo, it gets more difficult.
- This reminded me of building Linux From Scratch in a chroot jail ~15 years ago. It looks like some people are trying this out with Docker now.
- There's a lot of potential for flexible configuration. I just used command line options, since that was simplest, but it's easy to imagine a manifest file (Dockerfile) and multiple disk images as layers.
- If you want a simpler command line tool to try out,
unshare
is a small wrapper around theunshare
system call. It ships with most Linux distributions.
Happy hacking!