Friday, December 28, 2012

Syntax inference in canji

canji is a tool that aims to generate a tool-chain for virtual machines or hardware emulators given a compact description of said machine. The tool-chain contains a interpreter, jit, assembler, programmers' reference manual, etc. In this post we explore the syntax inference engine and how it is used to to generate the assembler of the tool-chain.

First some background information about the way the virtual machine is described. A language developed especially for the purpose is used in conjunction with a very relaxed grammar for the instruction syntax. This is an example of a move instruction similar to the move instruction of the x64 architecture:
    move base + scale * idx + offset, src
        mem[r[base] + scale * r[idx] + offset] = r[src];
where the first line describes the instruction's syntax in an abstract way and the second line is the concrete implementation of the instruction (mem and r are machine-global variables).

First, the syntax inference engine infers that scale and offset are use directly in integer arithmetic, thus these two operands are immediates and should be written as such. For instance #17 or #42.

Second, the inference engine derives that base, idx, and src, are used to index the register file r, thus should be written as for instance r0, r3, or r15. That is, the name of the register file (r) followed by the index of the register.

Third, the inference engine derives that the expression r[base] + scale * r[idx] + offset is used to index memory and thus should be written according to the memory access syntax, which is for instance [r0 + 2 * r1 + 2] (the Intel assembly flavor).

To sum up, these three items are used to derive a syntax for the move instruction that such that the following example is valid:
    move [r0 + 8 * r1 + 16], r8

Currently the syntax inference engine distinguish register files from memories by simply looking at their sizes. Even though this is in the general case not 100% accurate, it is accurate enough in the majority of (reasonable) cases. So it is likely that this will remain.

The syntax inference engine uses another inference engine to complete its work -- the optional operands inference engine. This engine find which operands that trivially can be made optional because they have a natural no-op value. An example of such operand is offset for the move instruction above, which can be made optional by encoding the absence of offset as if offset is 0. Thus, the following are valid move instructions:
    move [r0], r8
    move [r0 + r1], r8
    move [r0 + 16], r8

As you might realize, the inference engine make an assumption to be able to accept these three different version of the instruction. The assumption is that the + and * tokens in the move syntax description:
    move base + scale * idx + offset, src
"belongs to" the operand that directly follows them. This means that the operand is optional the + and * tokens that precedes it are also optional. For instance, the version of move without the offset operand have the following syntax:
    move base + scale * idx, src
It's a bit hard to explain in word, but pretty straight forward when you see examples like this.

So far, the canji syntax inference engine don't handle instruction flags. It is likely that this will simply be dealt with by looking at the type of the operand -- if an operand is of type u1 or s1 (meaning unsigned one bit integer and signed one bit integer, respectively).

A different approach to deal with the syntax inference is to remove the per instruction description of syntax and have a global syntax description instead. For instance, such global description could say which order source and destination operands should be, etc.

However, many assemblers handle several syntaxes, e.g., the AT&T and Intel assembler syntaxes. How should this be dealt with by canji? Well, currently there is no work on dealing with this, but it is technically not hard to do.

No comments: