The "I'm making a new programming language" checklist

Oct 10, 2023

You're making a new programming language. Perhaps it's inspired by your love of Python but you wish it could be faster. Maybe you work with Ruby but wish it had types. Or possibly, you just want to write a new language with a clear board of inspiration. It can be overwhelming to figure out exactly how to start. This blog post is intended as a checklist for a budding language creator, based on my experience from writing Derw, Gwe, and some other toy languages. I typically go for a no library approach, avoiding things like parser combinators, as a personal choice.

Make some examples

If you've got an idea of what your language might look like, it's a lot easier to build tooling around it. Is it going to have lots of C-style braces? Or significant whitespace? These choices define your language - what it will look like, how the tooling will work, how it is to program in. There are bigger decisions that influence the language too. In particular, what the type system looks like will influence how complicated your type checking will be.

Start small, with a hello world that fits your language. Then build more examples - how should functions look? How do types look? Does your language have loop constructs? The more complicated the example, the less likely you are to run into edge cases where a parser for hello_world won't work for upload_files.

Your compile target will also influence how the language looks. A garbage collected language has implications on how complicated your generator needs to be, so perhaps you take a shortcut by targeting a middle language (transpiling), which will allow your language to have different structures.

Choose a language for the compiler

Any language can work for your compiler. I personally only use languages with sum types, so that the ASTs involved in writing a compiler can be represented at the type level. You would usually want to choose a language that provides fast binaries, so that end users of your language will experience a good developer experience (DX). Distribution is important too - how is it to produce binaries for Windows? What about ARM OSX? There are some languages that have libraries that are very beneficial for parsing and generating your output. Using these libraries can drastically reduce the amount of time you spend writing your compiler, so it's worth investigating them to see if they fit what you're after.

All that being said though, I would recommend choosing a language that you are either comfortable with, or a language that you have a interest in. You're going to be spending a lot of time with this language, debugging complicated bugs in complex structures. A language where you're familiar with the debugging process is going to reduce the amount of frustration you face as you're trying to figure out things like how to type check something.

Eventually you'll want to consider dog-fooding, the art of writing your compiler in your own language. This has a number of benefits, like giving your language a big example project people can refer to, or forcing you to work on speeding up the compiler. But it also has downsides. It'll be slow, you won't be able to use libraries without reimplementing them and your language needs to be stable enough to have a compiler written in it. It can be a very rewarding project if executed well.

Write a tokenizer

One of the typical elements of a compiler is a tokenizer (could also be called a lexer) - something that takes your raw code source and converts it into an AST of symbols. These would typically come down to keywords, strings, identifiers, parenthesis, that type of basic symbol in your language. This AST will then be passed on to the parser, which will take each symbol and attempt to construct expressions from them. A tokenizer typically goes character by character, with some state to represent things like if you're inside a string or not - or whether a series of characters are an identifier or a number. Writing tests will help make sure that it's working correctly which is very important for later on, so that you can be sure the tokens are being correctly collected before being passed to the more complicated part of your code. A good test suite involves full example code of your programming language, so that you're not testing a token in isolation.

Thinking ahead to error messages, you will probably want to keep track of each token's index within the file - so that precise errors can be given to the user. In the tokenizer you'd not usually have many errors, other than things like when a string doesn't have a closing quote, but passing the index on to the parser allows the parser to remain unaware of the overall tokens yet return useful error messages.

Write a parser

Next you're going to want to take those tokens, and construct a new AST which represents your language. I usually divide this into two groups - blocks, and expressions. Blocks are your top level code constructs - functions, constants imports, type definitions. Functions are then made up of expressions, such as function calls, literals, equality. In programming generally an expression is typically something that results in a value. Assignment therefore wouldn't be part of the expressions group - but it might make sense in your parser to group them together with expressions, depending on your language. Parsers should typically be broken up into levels - talking in types, I use something like this:

Note that every step of the parsing chain involves a Result. If you're not familiar with the Result type, you can think of it as a type that represents either an error or a success. In the cases above, the error is returned as a string, whereas the success is returned as the construct they were parsing. Result typically has a bunch of helper functions defined - such as map and mapError, which allow succinct code to handle passing errors upstream to the root parse function. Here you'd use the index from the tokens to provide an error message as mentioned above. You'll want to provide an API to IDE extensions that allow them to get a list of errors with indexes for the error at a later point, so keep that in mind.

Again a test suite that involves full code examples is very important - you want to be sure that any changes you make to the parser still parse old valid examples. Don't sit on testing. The sooner you have a good test suite, the sooner you can make large changes with confidence.

Writing generators

There's a couple of directions you could go at this point, but my favourite thing to get started with are generators. These are the parts of the compiler that take your parsed AST and produce some output. This could be binary, assembly, or some other language. Derw, for example, compiles to several other languages - with the main focus being on TypeScript and JavaScript, since Derw leverages those ecosystems. But if you're writing a systems language intended to be used like C, then you might want to go straight to assembly. Languages will typically build upon existing toolchains, like LLVM and gcc in order to avoid writing a lot of it yourself. There's some terminology here like, the two main parts of a compiler being the backend and the frontend. The frontend is everything discussed so far, excluding generators. The backend is the generator. You could therefore use LLVM as a backend, a way to generate native binaries from the frontend of your compiler.

A great generator to get started with is your own language. That is to say, a compiler that takes Derw code and generates Derw. This can be heavily tested to ensure that your internal representation produces the same output as what you feed in, but there's also the super nice benefit of a self-producing generator being all you need to support code formatting.

CLI

A good CLI is fast and helpful. It should do the obvious thing where possible. What the obvious thing is is open to interpretation - I would recommend playing around with compilers in various languages to get a feel of how they work. I like to use flags with helpful error messages that specify how the program should be compiled and where it should be output (e.g formatting, build directories, stdout). A good sane default is that calling your compiler with no args inside a project directory should compile that project. When you're working on your first bits of code in your own language, you'll want to run the compiler on each change - so I suggest early on implementing some kind of filesystem watching that run the compiler on each file change.

A compiler will typically follow all the imports from an entry point file or directory, and compile those files too. This means that you need to write your I/O code to keep in mind that you'll be dealing with multiple files in most cases. Produce clean messages that inform users about each step - what files are being read, what their compile success was, and where they're being written to. This will be helpful in debugging - later on, you may want to put this logging behind a --verbose flag, but early on having a visible log of what's happening can help debug the CLI.

If you're using a library for argument parsing, there's generally only two difficult considerations: should you use threads for running the compiler? And how should all the paths be handled? Threads can make sense, depending on your language, to load all the files and write the output concurrently. Paths are just generally weird. Joining together paths to create output directories should be done carefully so that you aren't writing to files in the wrong place.

Type checking

Type checking is going to depend a lot on your language. Is it dynamically typed? Does it have type inference? These things will influence how complicated your type system is. For those which are strongly typed with type inference, I like to define an infer function for each of the expressions - passing in the expected type for cases where it's not clear what to infer (for example, an empty list). How you represent types is up to you - but I recommend at least splitting into generic and fixed types. If you have typeclasses or traits, you'll need some way to represent them too. You'll want to load all imported files into memory - so that you can get the type for a function in another file. Producing useful error messages are important here. Explain how you inferred a particular type and how the user can correct their code to get the right type.

Editor tooling

A nice touch for anyone using your language will be editor tooling. Syntax highlighting goes a long way to make someone more comfortable with your language. You'll also want to provide compiler error messages, and perform auto-formatting of code. The modern way of achieving these is through the language server protocol. It's quite big, so I suggest finding an existing LSP implementation and copying the parts that you want to implement. This way your extension will be available to multiple editors without needing to implement much editor-specific code.

You may also want to provide an alias to Github so that they will render your files with a close enough language's syntax highlighting. This can be done by setting associations in your .gitattributes file.

Language documentation

Once you've created your language, you'll want to document it. A great way to do this for free is with a Gitbook. Let users know how to install your language and get editor tooling. Let them know where to report bugs, or ask questions. Show examples of how to do common things, and explain how syntax in your language works. If your language is transpiling to something like JavaScript, a nice touch can be to show how different code compiles, so that users can understand the implications of their code. Gitbooks work just like git - you make changes, then commit them. Readers are also able to open pull requests to fix your document or add some clarification, saving you some work.

A community

Users are no doubt to have questions about how things work. It's up to you to choose where they ask those questions. Traditionally that would be on Github issues or mailing lists, but for more interactive support you could consider creating something like a Discord/Slack/Zulip server. Keep in mind that chat applications can be hard to Google, so if you want there to be a place when people Google "Error parsing function, token undefined", you might want to open a Github issue for that particular problem after you've helped the person on chat. That way the next person can Google the problem and find the solution without talking to you. Which chat network is best is a big debate, but I've found that the network effect comes into play - people are more willing to use a chat application if they're already using it for work or other communities.

To get users into your language, you'll want to talk about it. You'll want to talk about it a lot. Speak at local conventions, write blog posts, answer posts on forums. Don't spam it though. Your mentioning should be context aware. Does your language solve a problem? Write about it. Is there something novel about your language? Give a talk at a local meetup. Getting this audience is difficult and long. Finding some super users early on can help you through them spreading the word and creating examples for others to work on.

Packages

As your community grows, you'll want to enable them to write packages and libraries that can be shared with other people. This is a a huge topic, with no easy answers. Early on it's probably enough to use Git to fetch packages within a project - later, having a dedicated package server will help your users find the packages they're after. Security and safety are vitally important to package hosting, so this is a problem you'll want to spend a lot of time carefully considering the options for.

Once you have a have some packaging solution, you'll want to think about your standard library. Some core structures like lists and optionals will want to be as close to the user as possible, whereas you may want a http server in a separate package. To identify what belongs where, think about how you intend your language to be used. If it's meant as a client side rendering platform, having a virtual DOM or Html library be part of the standard library makes sense. If it's a strongly typed language without exceptions or null, then having optionals in the standard library makes sense. These days a testing library or framework really ought to be part of your standard library - it's a huge undertaking to write a full capable test framework from scratch if you include things like mocking, but start with the basics and add things as you need them.

Finally

Best of luck in your language creating adventures! It's a fun exercise that gets you experienced in a wide range of technology and algorithms. There's a lot of languages out there, so don't expect anything big to come from yours - so enjoy the process of developing it.

Derw