An update on Gypsum and CodeSwitch
Observant readers will notice I haven't written anything about Gypsum or CodeSwitch in a while. There hasn't been any activity in the repository either. I didn't really intend to put the project on hold for so long, but I've been focused on work for the last several months with conferences, big migrations, lots of features to implement very quickly, and lots of bugs to fix.
More about that another time. Work has reached manageable pace, and I'm ready to start tinkering on side projects again. It's time for a change in direction though: I plan to focus more on CodeSwitch and less on Gypsum (probably a lot less).
Thoughts on Gypsum
Every new programming language needs a niche — something it does better than other languages. I had a hard time finding that with Gypsum. When I started the project, I just wanted to experiment with some language ideas without trying to find a unifying theme. I ended up a mix of features that weren't really useful together. I think I have a somewhat better grasp on this now. What I want to create is a data oriented language.
- The language should provide a rich set of built-in types (lists, dictionaries, tuples, structs, functions, etc.). These should be built into the language itself instead of the standard library.
- New types (classes, traits) should be easy to define with minimal redundancy. In most cases, you should not need to write constructors, getters, setters, etc.
- Types should be expressive without getting in the way. It should be possible to say what you mean with types and create meaningful restrictions. Features like type parameter variance, named types, and mutability are positive examples in this category.
- The language should provide control over memory allocation and layout. It should be clear when memory is being allocated, and it should be allocated on the stack when possible. Pointer chasing between objects should be minimized.
- The language should provide flexible controls for building and accessing collections of data. For example, Python comprehensions, Scala for-loops, and pattern matching are incredibly useful here.
Gypsum had a few data oriented features, and I had designs for others. I was quite happy with pattern matching. I liked how classes and traits turned out, especially with arrayelements, although the syntax was a little awkward. I had plans for templated type parameterization. That would have allowed non-reference types to be used as type arguments, which would reduce pointer chasing.
There were many things I was unhappy about though. When I stopped working on Gypsum, I was in the middle of a major syntax overhaul. I originally loved Python-style white space, but I became frustrated with the difficulty of supporting multi-line lambdas and other constructs I think are interesting. I stopped liking the ambiguity of Scala-style property expressions, where parentheses can be omitted from function calls without arguments. It makes the code look cleaner but less readable, since it's unclear whether a function is being called or a field is being accessed.
My biggest gripe was with the day-to-day problems of writing a largish program in Python 2. I really missed type checking, and I wasted a lot of time debugging stupid problems the compiler should have told me about. My initial intent was to write a bootstrap compiler as quickly as possible, then rewrite it in Gypsum. Since this would just be a quick prototype, Python seemed like a good choice among the languages I knew at the time. The rewrite never happened though. I always wanted to implement more language features, since they would make it easier to write the new compiler. But each feature got harder to add as the language got more complex, and I ended up redesigning parts of the compiler repeatedly. I expect it will be easier next time around, as I have a better understanding of how it should look at the end.
The future of CodeSwitch
There are still a lot of things I want to do with in CodeSwitch. It has a more well-defined niche than Gypsum did: CodeSwitch should be a fast, embeddable, lightweight virtual machine that supports a wide variety of languages.
The principal challenge of building such a machine is supporting all the types and data structures from in a way where code written in different languages can interoperate. Varying semantics is also a challenge, but types are the bigger issue.
In terms of next steps, I'm going to write proper design document for the core of CodeSwitch. It needs to account for memory management, concurrency, package management, JIT optimization, and tracing / debugging at least. Until now, CodeSwitch has grown organically according to Gypsum's needs. It's difficult to refactor a system like this, since a small change affects everything. It's important to start with a good design.
With a design in hand, following the rule of three, I want to rebuild CodeSwitch to support three very different languages. If I can execute code from three languages that have relatively little in common with each other and get them to interoperate, I'll be confident my design is sound. Here are the languages I'm considering now:
- C#: Probably the most obvious choice. Being a multi-language VM, CodeSwitch is conceptually similar to CLR, so it makes sense to start with CIL bytecode. This covers a lot of types and features and frees me from having to write a compiler.
- Go: A relatively simple systems programming language. Structs, interfaces, and pointers are quite different from classes, so this will add some diversity to the type system. This will also get me to think about unsafe code and C/C++ interoperation.
- Scheme: a Lisp-like language with some unusual semantics (call/cc). Easy to parse and compile, but supporting the semantics in the VM may be challenging. Unlike the other two, Scheme is dynamically typed.
Before proceeding with any of this though, I need to do more research into the CLR. I've done very little work on Windows, and I've never written anything substantial with C#. There are a lot of language and VM lessons to learn there.