cj: Making a minimal, complete JIT

7 years ago, I had an idea for a JIT compiler in C for x86-64 architectures with a completely autogenerated backend based on machine-readable information on the instructions. Sounds cool, right?

I immediately built the hand-written parts of a byte buffer I could turn into an executable function and all the infrastructure needed for two instructions, NOP and RET, the minimum I needed to prove it worked. Then I found a JS library that contains the instruction information I need, covering all of the x86 ISA (thanks, asmjit!). It would work with Node.js; no problem, I can just write an emitter struct!

Then I abandoned the project. I got a bit of writer’s block, priorities shifted, and I ignored the idea for a few years, until last month. I was moving repositories around and the project was resurfaced. I realized my writer’s block had gone and I was actually quite eager to attack it again. So I did.

Over about two weeks, I reworked the project to semi-autogenerate the x86 ISA (the code is still full of warts and special cases, because I had trouble finding the right abstraction at first and just kept going). I then found mra_tools, which provides a machine-readable spec for ARM64, so I also added a backend generator for it (full disclosure: I had an LLM write the specification extractor for me, because I couldn’t figure out the project layout quickly and figured it would be easy enough; I also let it write code comments for me, and they’re probably hilariously wrong). The JS code is admittedly brutally messy, but hopefully one doesn’t have to touch it all that much anymore (famous last words)!

Anyway, what we ended up with looks somewhat like this:

// x86 version of a triangular number sum
#include <stdio.h>
#include "ctx.h"
#include "op.h"

int main(void) {
  cj_ctx* cj = create_cj_ctx();
  cj_operand rax = cj_make_register("rax");
  cj_operand rdi = cj_make_register("rdi");  // n (argument)
  cj_operand rcx = cj_make_register("rcx");  // loop counter
  cj_operand zero = cj_make_constant(0);
  cj_operand one = cj_make_constant(1);

  cj_mov(cj, rax, zero);         // sum = 0
  cj_mov(cj, rcx, one);          // i = 1
  cj_label loop = cj_create_label(cj);
  cj_label done = cj_create_label(cj);
  cj_mark_label(cj, loop);
  cj_cmp(cj, rcx, rdi);          // if (i > n) break;
  cj_jg(cj, done);
  cj_add(cj, rax, rcx);          // sum += i;
  cj_add(cj, rcx, one);          // ++i;
  cj_jmp(cj, loop);              // loop;
  cj_mark_label(cj, done);
  cj_ret(cj);                    // result already in rax

  typedef int (*tri_fn)(int);
  tri_fn fn = (tri_fn)create_cj_fn(cj);
  printf("tri(10) = %d\n", fn(10));  // prints 55

  destroy_cj_fn(cj, (cj_fn)fn);
  destroy_cj_ctx(cj);
  return 0;
}

Already quite good! We can basically handroll assembly at run-time, make it generate a function, and then jump in. Here are some example programs. What happens is that we write to a buffer, mark it to executable using mprotect, and jump into it.

There are some limitations, however. Firstly, it’s not portable yet. The backend is detected automatically, but the register names and most of the instructions are still different, much like when we write assembly. We also might have to write our own function prologues, depending on the architecture.

I then decided I would at least work on a few very simple “high-level” abstractions to also show how we could build out the JIT compiler to allow for backend-independent code:

#include <stdio.h>
#include "builder.h"

typedef int (*sum_fn)(int);

int main(void) {
  cj_ctx* cj = create_cj_ctx();
  cj_builder_frame frame;
  cj_builder_fn_prologue(cj, 0, &frame);

  cj_operand n = cj_builder_arg_int(cj, 0);
  cj_operand sum = cj_builder_scratch_reg(0);
  cj_operand i = cj_builder_scratch_reg(1);
  cj_operand one = cj_make_constant(1);

  cj_builder_assign(cj, sum, cj_builder_zero_operand());

  cj_builder_for_loop loop = cj_builder_for_begin(cj, i, one, n, one, CJ_COND_GE);
  cj_builder_add_assign(cj, sum, i);
  cj_builder_for_end(cj, &loop);

  cj_builder_return_value(cj, &frame, sum);

  sum_fn fn = (sum_fn)create_cj_fn(cj);
  printf("tri(10) = %d\n", fn ? fn(10) : -1);

  destroy_cj_fn(cj, (cj_fn)fn);
  destroy_cj_ctx(cj);
  return 0;
}

As you can see above, the code is playing at a much higher level of abstraction! We have for-loops and calling conventions and register names do not concern us anymore. It’s still quite low-level: you own the full lifecycle, and you’ll probably have to wrangle assembly for anything non-trivial.

For now, I do not intend to move much further into building this out. The concept has been more than proven, and I can already feel the all-encompassing obsession consume me. Instead, I leave it as an artifact to play with as-is.

I’m quite proud of it. It took me seven years, but in the end I accomplished much of what I set out to do: understand the x86 ISA (and now ARM too!) and build a JIT in the process! Even if it’s just a toy, it’s a powerful toy, and it required a lot of fiddling to get it (hopefully somewhat) right.

So, what do I intend to do with it? To be honest, I’m not quite sure yet. I budgeted a time-boxed prototype of a simple Forth for myself to see how language implementation in this framework would feel. From there, we will see where things take us. If I do end up finishing the Forth, you can look forward to another post!