MiniML: Writing a Functional Compiler from Scratch - Introduction

This series of blog posts are intented to teach you how to compile a ML-like language into machine executable code. The focus will be compilation rather than optimization. I’d be more than happy if they helped you in any way.

This post introduces MiniML, and covers routine lexing and parsing.

Motivation

There have been lots of resources (tutorials, blog posts, books etc) on how to write a compiler for an imperative language, like C or Java. As TA of compiler course I’ve written more than 3 compilers for C/Java-like languages. For imperative languages I recommend the good old “Tiger Book” (Modern Compiler Implementation in {C,Java}) and Monkey language books. But that’s not the case for functional languages, especially when you like to implement the compiler in a imperative language. Actually implementing an interpreter would be very simple in a functional language, as I remember ~300 sloc with some monad transformer would do.

So somehow I took the hard way to implement an compiler for ML in python, not C++ because python’s imperative enough and easy to write relatively. It’s helpful really as it helps you to understand what happens behind all the algebraic equations. The existing resources however did not help much

Existing work

One is Write you a Haskell. However it was started in 2015 and the last update seems to remain in 2015, and it covers only parsing and some introductory typing.

Another is Compiling a Functional Language. I discovered only recently (after I started this series) and I think it’s great.

The good old “SLPJ book” (The Implementation of Functional Programming Languages) seems a comprehensive guide, yet it’s mostly theory. You cannot write an imperative compiler just with it.

Actually I have only read until chapter 5 or 6 of the SLPJ book so correct me if wrong.

The output produced is a program on the abstract SECD machine, which you can interpret it with this interpreter or transpile them into C and then compile into native code.


The boiler plate: lexing and parsing

This part is quite irrelevant to MiniML, so I’ll just introduce it very briefly.

The goal of lexing and parsing is to convert MiniML source code into an abstract syntax tree. The compiler first sees MiniML source code by something like src = open('xxx', 'r').read(), which means type(src) is str. That’s bad because it does not have the structural information (e.g. which subexpressions constitute an expression …).

MiniML language is specified by context free grammar and then use the ANTLR parser generator to convert src into a syntax tree, usually called the concrete syntax tree to avoid mixing with abstract syntax tree. Concrete syntax tree usually contains distraction from the grammar that is irrelevant to MiniML program structure, and are hard to manipulate as I tried to experiment with ‘AST rewriting’ in MiniML.

The code between angles shows the process from lexing to constructing ast.

# main.py

def main(argv):
    try:
        global args
        args = doParseArgs(argv)
# >>>
        inputs = FileStream(args.infile)
        tokens = doLex(inputs)
        cst = doParse(tokens)
        ast = doConstructAST(cst)
# <<<
        ast = doNamer(ast)
        ast = doTyper(ast)
        ast = doPatMat(ast)
        ast = doDeBrujin(ast)
        secd = doSECD(ast)
        return 0
    except MiniMLError as e:
        if args.backtrace:
            raise e
        print(e, file=sys.stderr)
        return 1

I defined the generic AST functionality, especially the visitor mechanisms in ast.py and astnode.py.


Main References