Compiling Interpreted Languages

I think a few years back I’d have been confused by the concept of compiling JavaScript. I would have assumed that compiled and interpreted languages were fundamentally different beasts. Reading SICP taught me that on a deep level, programs in any language can be implemented correctly via compilation or interpretation. I started to think as programs as descriptions of desired behaviour, given a particular language definition.

Given that programs are descriptions, we can fulfil it however we like! Either we can use the program text as runtime input to an existing host-native program which will interpret it, or we can compile that program into a equivalent program in another language.[^1]

[^1]: Still further, we could mix both, interpreting and compiling - as interpreters with just-in-time (JIT) compilers do.

Let’s demonstrate this for a simple language with two operations, VAR and PRINT, with strings are its only data-type. Here’s a program in that language:

VAR name "Melody"
VAR name2 name
VAR name "Ada"
PRINT "Hello" name name2

Running this program correct implementation of our spec should output "Hello Ada Melody".

As with most languages, parts of this specification rely on some operations being provided as primitives by the host language or execution environment. Our PRINT operation cannot be defined within the language, we'll require support from our hosting interpreter or compiler, which in turn will use primitive operations provided by the target machine.

Here's a spec for the operations:

Here's a rough grammar definition:

program = operation+
operation = var | print
var = VAR identifier value
print = PRINT value+

value = string | identifier
identifier = \w+
string = " [^"]+ "

We can implement this spec with either an interpreter or a compiler. Our interpreter will take in the program text at runtime, analyse it, and call methods in our host environment to fulfil the spec. Our compiler will take the program text and output an equivalent program in another language that, when run, will perform the behaviour described by the program.

Parsing

First we'll want to transform our program's text into a data structure we can analyse - a syntax tree. This is called parsing - a necessary process for any interpreter or compiler. Our language is simple, and this means our parser is too:

function parse(text) {
    return text.split('\n').map(operationParse)
}

function operationParse(text) {
  const [_, type, expression] = /(VAR|PRINT) (.+)$/.exec(text)
  return {type, operands: operandParse(expression)}
}

function operandParse(text) {
  return text.match(/"[^"]+"|\w+/g).map(tokenParse)
}

function tokenParse(token) {
  return token.startsWith('"')
    ? {type: 'string', text: token.slice(1, -1)}
    : {type: 'identifier', id: token}
}

This'll give us the following tree for the above program:

[{"type":"VAR","operands":[{"type":"identifier","id":"name"},{"type":"string","text":"Melody"}]},{"type":"VAR","operands":[{"type":"identifier","id":"name2"},{"type":"identifier","id":"name"}]},{"type":"VAR","operands":[{"type":"identifier","id":"name"},{"type":"string","text":"Ada"}]},{"type":"PRINT","operands":[{"type":"string","text":"Hello"},{"type":"identifier","id":"name"},{"type":"identifier","id":"name2"}]}]

After that let's write an interpreter for it. We're leaning as heavily as we can on our host environment - JavaScript - to implement our spec. For instance we can use a JS object to implement variable assignments and re-assignments as the behaviour matches our spec. console.log happens to implement the space separation we require too!

function iterpretProgram(ast) {
  const env = {};
  ast.forEach(n => interpretOperation(env, n))
}

function interpretOperation(env, node) {
  switch (node.type) {
    case 'VAR': return assign(env, node);
    case 'PRINT':
      return console.log(...node.operands.map(n => interpretNode(env, n)))
    default: throw Error(`unknown node ${node.type}`)
  }
}

function interpretNode(env, node) {
  switch (node.type) {
    case 'identifier': return lookup(env, node.id);
    case 'string': return node.text
    default: throw Error(`unknown node ${node.type}`)
  }
}

function assign(env, {operands: [name, value, ...err]}) {
  if (err.length > 0)
    throw Error('Too many arguments to operation VAR')
  if (name.type !== 'identifier')
    throw Error('Name in VAR must be an identifier')
  return env[name.id] = interpretNode(env, value)
}

function lookup(env, id) {
  if (!(id in env)) throw Error(`Undeclared variable ${id}`)
  return env[id]
}

We can run our program as follows:

console.log(iterpretProgram(parse(sourceCode)))  

Compiler

We can see the symmetries between compilation and interpretation when directly translating our interpreter to our compiler. Rather than immediately performing operations, we'll now output equivalent program text in our target language - whether that's in machine code (binary format), or a higher level language like C or Go.

The choice of target host/language will decide how much we need to implement ourselves. For instance, targeting C lets us rely on C's existing function call mechanism, which handles short lived data on the stack. Conversely, targeting LLVM, assembler or machine code would require us to do far, far more.

We're going to target Go, again leaning on what it provides us as much as we can. Implementing our VAR operation is easy: since our language only supports strings as values our environment will simply be a map[string]string - a map of string keys and values. Again, its semantics match that of our spec so we may as well keep it simple.

function compileProgram(ast) {
  const env = {} // our compiler is too simple to need a compile-time environment, but let's keep the function signatures equivalent
  return `package main

  import "fmt"

  func main() {
      env := make(map[string]string)
      ${ast.map(n => compileOperation(env, n)).join('\n')}
  }`
}

function compileOperation(env, node) {
  switch (node.type) {
    case 'VAR': return compileAssign(env, node);
    case 'PRINT':
      return `fmt.Println(${node.operands.map(n => compileNode(env, n)).join(', ')})`
    default: throw Error(`unknown node ${node.type}`)
  }
}

function compileNode(env, node) {
  switch (node.type) {
    case 'identifier': return compileLookup(env, node.id);
    case 'string': return `"${node.text}"`
    default: throw Error(`unknown node ${node.type}`)
  }
}

function compileAssign(env, {operands: [name, value, ...err]}) {
  if (err.length > 0) 
    throw Error('Too many arguments to operation VAR')
  if (name.type !== 'identifier')
    throw Error('Name in VAR must be an identifier')
  return `env["${name.id}"] = ${compileNode(env, value)}`
}

function compileLookup(_env, id) {
  return `env["${id}"]`
}

This would compile our program above into:

package main

import "fmt"

func main() {
  env := make(map[string]string)
  env["name"] = "Melody"
  env["name2"] = env["name"]
  env["name"] = "Ada"
  fmt.Println("Hello", env["name"], env["name2"])
}

Running this on the Go playground gives us the expected output!

Runtime vs compile time

Languages do differ in how much we must do at runtime, but that doesn’t prevent us for compiling large swathes of it.

For an example of some behaviour that mandates runtime support, consider:

const name = fs.readFileSync(./property’, { encoding: ‘utf8’ })
const fn = global[name]
console.log(global[fn()])

You can see here that the method to be called (name) depends on input outside the program, and the value of global[name] does too. How much of such a dynamic program could we meaningfully compile?

One immediate benefit if we compiled our example program is avoiding re-parsing the program on re-running. The text of the program doesn’t change at runtime after all.

Beyond that we can optimise the program itself, trading more work at compile time for less work at runtime. This can be the same kind of program manipulations we could do manually. For instance we can see in the below example, even without declaring variables const, that multiple does not vary across iterations:

let a = 10
let b = 20
let sum = 0;
for(const x of xs) {
  let multiple = a + b;
  sum += x * multiple
}

So we could rewrite (or a compiler could do it for us) to this equivalent but more efficient code:

let sum = 0;
for(const x of xs) {
  sum += x * 30
}

Beyond that, many optimisations work below the level that we as language users have access to. This kind of optimisation implements the same behaviour or semantics more efficiently. For instance, we can make variable access in JavaScript involve fewer branches, by compiling in the exact lexical scope a particular variable is written or read from. For instance:

function A() {
   let count = 0
   return function B() {
      count += 1
      return count
   }
}

We can tell at compile-time that the count variable exists in A's scope. A faithful runtime implementation of variable lookup in JavaScript - which could form part of a JS interpreter or runtime library written in JS - could look like this:

function lookup(env, property) {
  while(env) {
    if(env.has(property)) {
       return env.get(property)
    }
    env = env.parent
  }
  return undefined
}

However, since we know count exists in A's scope, we will incur less runtime work than if we call our lookup function at runtime by instead compiling our program to deterministically access the variable in the correct scope:

envB.parent.get('count')

Interestingly, this is precisely the type of optimisations just-in-time compilers (JITs) perform. JITs are another reminder that dynamic languages can profit greatly from compilation. There's nothing theoretically stopping us from doing lots at runtime in compiled implementations, or lots of compilation in interpreted ones. Those decisions are trade-offs, and will decide how much of the pros/cons of each approach we'll get depending on where we sit between the two extremes.

In my next post, I'll outline the high-level architectural approach I'll take in writing my JS to C compiler.