Writing a bootstrapping compiler and the progressive translation of Derw - July 2022
This post is a little later than usual, but there’s a good reason for it! I’e been diving in deep into translating Derw’s compiler, currently written in TypeScript, into Derw. At this point roughly 44% of Derw’s compiler is written in Derw.
This is a big deviation from alternative web-focused ML languages like Elm and PureScript, which are written in Haskell, and Reason, which is written in OCaml. The process of rewriting is pretty rewarding, so I’ll share with you how exactly bootstrapping works.
Bootstrapping
To paraphrase Wikipedia: compiler bootstrapping is when the compiler for a language is written in that language itself. Typically it involves writing a compiler in another language initially. Once the compiler has reached a point where it can compile all the needed code, the compiler is rewritten in the target language.
Why
A number of languages bootstrap themselves, but for me the most memorable is probably Go. Go was originally written in C, until a point was reached where it was possible to machine translate the C code into Go, and then use the Go compiler from the previous release to build it. I’ve heard Go developers talk about how this approach has helped with bug fixing, performance, and simplicity. It makes sense that a large complicated project like a compiler will help reveal bugs: there are a lot of steps involved in a compiler, including complicated algorithms, I/O, and string manipulation. Compilers need to be fast for good developer experience, so a compiler written in the target language will drive performance related refactors. It should be as easy as possible to contribute to the compiler, and that’s easiest for Derw developers if Derw is written in Derw, since they already know Derw. That’s a lot of Derw.
More info on the Go process can be found here and here.
How
As mentioned, the first step was to have a compiler for Derw written in some other language: I chose TypeScript mainly because Derw’s generated code would be mostly TypeScript, and having interop between the two would be beneficial to real-world usage of Derw.
The compiler is composed of several parts: a lexer, a parser, code generation, and CLI tooling. Pure functions, like those for code generation, parsing and lexing are the easiest to move over. These map directly to pure Derw functions, and often have similar performance characteristics thanks to the compiled Derw output being performant. Impure functions, such as those that involve writing and reading from files, require a bit more work. Derw supports async/await through do-notation, but additions were needed to support a direct translation of impure TS code into impure Derw code. Things like wrapping generated code in async/await wrappers and having top-level control expressions (ifs and case..of).
Moving each file across at a time was quite easy: in Derw, types, functions and constants can be imported from TypeScript files. Likewise, from TypeScript, it is possible to import types, functions and constants from compiled Derw files. This meant that the workflow roughly followed:
Take a file, for example src/generators/ts.ts
Create a file matching the Derw naming convention but with the extension _derw to avoid naming collisions (for example src/generators/Ts_derw.derw)
Copy imports from the .ts file into the .derw file
Open the .ts and .derw file side by side, and rewrite the TypeScript into Derw
Add exports (for example generateTypeScript)
Run Derw compile from a released version of the compiler
In the original .ts file, import the export from the Derw file, renaming the function within the .ts file to avoid collision (for example generateTypeScriptOld)
Run tests
Once all tests are fixed, remove the original .ts file and rename the Derw file without the _derw suffix (e.g src/generators/Ts.derw)
Commit the new Derw file, the deletion of the original .ts file, and the compiled Derw output file
Derw has a large test suite which tests various language features for their compiled output in JavaScript, TypeScript and Derw, which makes ensuring that rewrites are consistent with the original quite easy. Likewise, being able to leverage the types that exist already reduces the overhead involved in rewriting files.
Committing the compiled Derw file helps ensure that if the compiler crashes for some reason, it’s still possible to just run the compiled version of the file to get the compiler working again. It’s a little messy since now you end up with both src/generators/Ts.derw and src/generators/Ts.ts, so at some point I may want to move those to their own build folder.
Current status & what’s next
Currently all generators have been moved to Derw, along with block parsing and name collisions. That leaves the parser itself, the lexer, type collisions, and tests. Everything apart from the tests will be translated to Derw. The tests are excluded because there’s a lot of them, so if I figure out a way to nicely machine-translate them in the future I probably will, but otherwise Derw does not gain much by rewriting the tests.
Once everything planned has been translated to Derw, it will likely be the version 1.0.0 release. There’s still some aspects missing that I would like to improve on, like better type inference and error messages, but they can come later.
So far I’m quite happy with the progress made, and it’s a great feeling when the Derw compiles and works as intended. This is the first of my languages which have been bootstrapped, and for all those aspiring compiler developers out there, I can highly recommend it.
Derw Changelog
do..return notation marks a function as async, and all items within the do..return block are awaited
The Promise type has been added to the global scope
ts-core uses Derw-compatible data structures
Keywords used as functions (e.g promise.then) compile correctly
Template generated files have an extra newline appended to the end
Folders can be passed into compile via —files
Generated case..of code has type assertions to ensure that generated TypeScript is type-safe
The language server now shows Derw function and const type signatures in the editor, and shows Derw type definitions.
if..else is supported as top level expression in do..return blocks
Additional parens in function calls are preserved when generating Derw output
Operators have preserve parens
A script to keep track of translation progress has been added
Operators inside object literals are correctly parsed