Friday, December 28, 2012

Syntax inference in canji

canji is a tool that aims to generate a tool-chain for virtual machines or hardware emulators given a compact description of said machine. The tool-chain contains a interpreter, jit, assembler, programmers' reference manual, etc. In this post we explore the syntax inference engine and how it is used to to generate the assembler of the tool-chain.

First some background information about the way the virtual machine is described. A language developed especially for the purpose is used in conjunction with a very relaxed grammar for the instruction syntax. This is an example of a move instruction similar to the move instruction of the x64 architecture:
    move base + scale * idx + offset, src
        mem[r[base] + scale * r[idx] + offset] = r[src];
where the first line describes the instruction's syntax in an abstract way and the second line is the concrete implementation of the instruction (mem and r are machine-global variables).

First, the syntax inference engine infers that scale and offset are use directly in integer arithmetic, thus these two operands are immediates and should be written as such. For instance #17 or #42.

Second, the inference engine derives that base, idx, and src, are used to index the register file r, thus should be written as for instance r0, r3, or r15. That is, the name of the register file (r) followed by the index of the register.

Third, the inference engine derives that the expression r[base] + scale * r[idx] + offset is used to index memory and thus should be written according to the memory access syntax, which is for instance [r0 + 2 * r1 + 2] (the Intel assembly flavor).

To sum up, these three items are used to derive a syntax for the move instruction that such that the following example is valid:
    move [r0 + 8 * r1 + 16], r8

Currently the syntax inference engine distinguish register files from memories by simply looking at their sizes. Even though this is in the general case not 100% accurate, it is accurate enough in the majority of (reasonable) cases. So it is likely that this will remain.

The syntax inference engine uses another inference engine to complete its work -- the optional operands inference engine. This engine find which operands that trivially can be made optional because they have a natural no-op value. An example of such operand is offset for the move instruction above, which can be made optional by encoding the absence of offset as if offset is 0. Thus, the following are valid move instructions:
    move [r0], r8
    move [r0 + r1], r8
    move [r0 + 16], r8

As you might realize, the inference engine make an assumption to be able to accept these three different version of the instruction. The assumption is that the + and * tokens in the move syntax description:
    move base + scale * idx + offset, src
"belongs to" the operand that directly follows them. This means that the operand is optional the + and * tokens that precedes it are also optional. For instance, the version of move without the offset operand have the following syntax:
    move base + scale * idx, src
It's a bit hard to explain in word, but pretty straight forward when you see examples like this.

So far, the canji syntax inference engine don't handle instruction flags. It is likely that this will simply be dealt with by looking at the type of the operand -- if an operand is of type u1 or s1 (meaning unsigned one bit integer and signed one bit integer, respectively).

A different approach to deal with the syntax inference is to remove the per instruction description of syntax and have a global syntax description instead. For instance, such global description could say which order source and destination operands should be, etc.

However, many assemblers handle several syntaxes, e.g., the AT&T and Intel assembler syntaxes. How should this be dealt with by canji? Well, currently there is no work on dealing with this, but it is technically not hard to do.

Monday, December 10, 2012

Generating tool-chains

In my previous post I outlined a way of generating interpreters and JIT compilers from a simple description of the state of the target (virtual) machine and the semantics of the instructions. I also linked to a prototype called jiggawatt where I implemented these ideas.

Since then I continued to explore what can be generated from such small description of the target machine:
  • Interpreter that doubles as a simple inference engine,
  • Optimizing JIT compiler,
  • Assembler and disassembler,titool
  • Instruction binary encoding, and
  • Documentation including textual description of each instruction.
Yes, generating textual description of each instruction. Not only that, the instructions are ordered such that similar instructions are grouped together in the documentation. In addition I've experimented with generating examples for each instruction showing how the instruction can be used, however, I didn't managed to get this entirely correct. Despite that, the end result is quite promising and I think there's potential for more (you find an example of the generated documentation here).

Unfortunately, since jiggawatt main purpose was to explore what could be done -- not to be a long-living project -- it has grown into a huge ball of untested code. But out of the ideas and experience a new project has been born: canji (yet, it is intentionally misspelled).


canji's ambitions is much greater than those of jiggawatt. With canji I aim to generate a whole tool-chain for a (virtual) machine: interpreter/simulator (with JIT), assembler/disassembler, a static optimizing compiler, ELF, debugger, and documentation with textual description and example code. I also aim to have a test-case generator that generates tests that verifies the interpreter and static compiler. Simpler things such as Emacs modes for the assembler is also possible and hopefully much more.

The concrete goal of canji is to generate an entire tool-chain, but the less obvious goal is to explore what implicit information that lies hidden in a machine description. For instance, to generate an assembler with operand type-checking (e.g., to check that immediate and register isn't mixed up) in jiggawatt I let the assembler generator assume that if an operand is used to index an array, that operand must be prefixed by the name of that array. For example, an instruction load that take two operands, an immediate and a destination register, must be written as follows:
    load #10, r0
assuming the array describing the register file is called r.

Assumptions like this may or may not be a good idea -- as I said canji is a project that explores what can be generated from a very tiny description of a (virtual) machine.