Thursday, January 17, 2013

Python type checking -- or why you should learn Prolog

A while ago I wrote about the programming language Prolog. Even longer ago I wrote about a python type-checker (and compiler) I was then working on. That type-checker and compiler never reached a state where it could compile any python code, however, it did type check quite complicated code.

Today I started working on a new type-checker for python that is based on Prolog. Prolog is a logic programming language with several features that makes it an ideal language to implement type-checking in. Here's an example of a piece of code that it successfully type-checks:
    A = 1
    B = 1.0
    C = "c"

    def plus_C(value):
        return C + value

    def func0(a):
        global A, B, C
        A = "hello"
        result = plus_C(A)
        C = 10
        result = plus_C(B) + a
        return result


The way it works is pretty straight-forward. The type-checker simply disassembles the relevant functions (using the dis module), and outputs Prolog code where each instruction is replaced with a Prolog predicate, as follows:
    py_plus_C(Module, V_value, ReturnValue) :-
        load_attr(Module, 'C', T0),
        binary_add(T0, V_value, T1),
        return_value(T1, ReturnValue).

    py_func0(Module, V_a, ReturnValue) :-
        store_attr(Module, 'A', py_str),
        load_attr(Module, 'plus_C', T0),
        load_attr(Module, 'A', T1),
       
call_function(Module, T0, [T1], T2),
        store_attr(Module, 'C', py_int),
       
load_attr(Module, 'plus_C', T3),
       
load_attr(Module, 'B', T4),
        call_function(Module, T3, [T4], T5),
   
    binary_add(T5, V_a, T6),
       
return_value(T6, ReturnValue).

Here, the type-checker infers the return type of func to be float, and the type of the argument a to be either int or float. Note that the definition of Module is omitted because it's mostly just an internal representation of the python module and not terribly interesting.

In addition to dealing with mutable state (mutation really complicates type-checking), it handles parametrized types (e.g., list) as the following example illustrates:
    def func1(lla, lb):
        return lla[0] + lb


where it infers that lla, lb, and the return type is:
  • list(float), float, float
  • list(int), float, float
  • list(int), int, int
  • list(str), str, str
  • list(list(X)), list(X), list(X)
  • dict(int, list(X)), list(X), list(X)
and several more alternatives. That is, the type-checker not only makes sure that a piece of code is correct with regards to types, it also infers which types are possible.

Now to the cool part.


The entire thing is 236 lines of code.


Ok, granted it doesn't handle ifs, fors, etc, but still. The type-checker can be found here. Unfortunately, I currently don't have time to continue working on it, but the code is there and proofs the idea.

Wednesday, January 16, 2013

My open source and my closed work

I usually don't write about things that actually happen to me. Instead I focus on describing tools, cool ideas, or just telling a joke. In this post, though, I'll tell an ongoing story about bringing protest into the company where I work.

protest is a neat little unit testing framework for C++ that I started working on around September last year. It's been open-source under the Boost Software Licence since its very beginning. The reason I started working on it was for one simple reason that has caused many hackers to start... well, hack... scratching an itch.

The particular itch in this case is the terrible state of C++ unit testing. I've tried several testing frameworks over the year, but all make me say yuck, do they really want me to write that? or should I really do that manually? or even worse what... I can't do that?

I've already written about protest here several times, so I won't do that again. What I will do however, is describing the process of using protest at my work. It started in November when I presented protest to my colleagues. They were positive and saw it as a good candidate for replacing UnitTest++ that we're currently using.

I'm working at a company that is very protective of it's source code and information -- for good reasons. What I am worried about is that if we started using protest without explicit acceptance and knowledge from some manager(s), I might run into problems if the source is found on the internet by the "security police" since it has my name on it (my user name on Gitorious is my real name, just as here on my blog). If they found it under my name on the internet, they can (falsely) draw the conclusion that I brought the code outside of the company.

So, to make sure this wouldn't happen I contacted a manager and explained the situation. Unfortunately, he contacted a person who specializes in law that looked into the matter in more detail. The response I got was we can't accept this, CompanyName might lose the the right to use protest if this-and-that, which wasn't true at all of course. 

I got a bit put off by this, but I finally got back to the issue this week. My response went along the following lines:

Regardless if you acknowledge and accept the license under which protest is published, you should understand that any open-source software can be used by any employee at CompanyName at any time. I know for a fact that we/CompanyName is using open-source licenced software, indeed, we rely on it daily.

I'm not sure if this was I good idea or not.

Friday, January 4, 2013

Documentation generation in canji

canji is a tool-chain generator that I've been writing about before on this blog and on gitorious. The current focus lies on inferring various attributes of instructions. Recently I've spent some time on generating a descriptive string for each instruction, which will be used for the generated instruction set manual. This blog describes the approach for doing so.

Instructions are described in a C-like language, so to generate a descriptive string for an instruction the code implementing it has to be understood on a high level. This may sound like a dauntingly complex task, but since the purpose of canji is to build virtual machines, we have a good idea of what kind of code that will be processed.

For instance, the following is needed to understood:
  • reading and writing registers,
  • reading and writing memory,
  • addition, multiplication, shifting, etc,
  • incrementing and decrementing
  • pushing and popping,
  • branching (assigning the  program counter),
  • conditional jumps, moves, etc,
  • call function,
  • return from function,
  • and more.
In theory, there are arbitrary many ways of expressing any of the above. In practice, though, there aren't particularly many ways of implementing the pop instruction, or the add instruction, or any/most of the other.

So the obvious approach to generate a descriptive string is to simply implement a bunch of real-world instructions and make the description generator generate an appropriate description.

This kind of work is time consuming but simple, and in fact not so much different to what an optimization pass of a compiler does. When a piece of code is optimized it is matched against a bunch of patterns and if any of the patterns matches, the code is rewritten according to the rules of the matched pattern.

When the description is generated for an instruction, its code is matched against a set of patterns, and when a matching pattern is matches a description is generated.

That's the simplest approach for doing this and the approach taken so far. It's likely that this will be extended in the future and have several layers of patterns that deals with different attributes of the code, e.g., conditional, etc.

Here's some example of descriptions generated by canji and the code its generated from:
  • Loads the value at memory address rsrc into rdst.  
    load src, dst
        r[dst] = mem[r[src]];
     
  • Stores rsrc at memory address sp and increments sp.
    push src
        mem[sp] = r[src];
        sp = sp + 1;
  • Branches to addr if  status.z == 1.
    jeq addr
        if (status.z == 1)
           pc = pc + 1;
        else
           pc = addr;
Obviously, this is not perfect, but provided the low-level code it's generated from it pretty ok. Also, it's worth noticing that names of instructions, variables, operands, etc, aren't used used to infer any information.

The next step is to analyse the registers of the virtual memory infer what purpose they fill and generate a description.

Friday, December 28, 2012

Syntax inference in canji

canji is a tool that aims to generate a tool-chain for virtual machines or hardware emulators given a compact description of said machine. The tool-chain contains a interpreter, jit, assembler, programmers' reference manual, etc. In this post we explore the syntax inference engine and how it is used to to generate the assembler of the tool-chain.

First some background information about the way the virtual machine is described. A language developed especially for the purpose is used in conjunction with a very relaxed grammar for the instruction syntax. This is an example of a move instruction similar to the move instruction of the x64 architecture:
    move base + scale * idx + offset, src
        mem[r[base] + scale * r[idx] + offset] = r[src];
where the first line describes the instruction's syntax in an abstract way and the second line is the concrete implementation of the instruction (mem and r are machine-global variables).

First, the syntax inference engine infers that scale and offset are use directly in integer arithmetic, thus these two operands are immediates and should be written as such. For instance #17 or #42.

Second, the inference engine derives that base, idx, and src, are used to index the register file r, thus should be written as for instance r0, r3, or r15. That is, the name of the register file (r) followed by the index of the register.

Third, the inference engine derives that the expression r[base] + scale * r[idx] + offset is used to index memory and thus should be written according to the memory access syntax, which is for instance [r0 + 2 * r1 + 2] (the Intel assembly flavor).

To sum up, these three items are used to derive a syntax for the move instruction that such that the following example is valid:
    move [r0 + 8 * r1 + 16], r8

Currently the syntax inference engine distinguish register files from memories by simply looking at their sizes. Even though this is in the general case not 100% accurate, it is accurate enough in the majority of (reasonable) cases. So it is likely that this will remain.

The syntax inference engine uses another inference engine to complete its work -- the optional operands inference engine. This engine find which operands that trivially can be made optional because they have a natural no-op value. An example of such operand is offset for the move instruction above, which can be made optional by encoding the absence of offset as if offset is 0. Thus, the following are valid move instructions:
    move [r0], r8
    move [r0 + r1], r8
    move [r0 + 16], r8

As you might realize, the inference engine make an assumption to be able to accept these three different version of the instruction. The assumption is that the + and * tokens in the move syntax description:
    move base + scale * idx + offset, src
"belongs to" the operand that directly follows them. This means that the operand is optional the + and * tokens that precedes it are also optional. For instance, the version of move without the offset operand have the following syntax:
    move base + scale * idx, src
It's a bit hard to explain in word, but pretty straight forward when you see examples like this.

So far, the canji syntax inference engine don't handle instruction flags. It is likely that this will simply be dealt with by looking at the type of the operand -- if an operand is of type u1 or s1 (meaning unsigned one bit integer and signed one bit integer, respectively).

A different approach to deal with the syntax inference is to remove the per instruction description of syntax and have a global syntax description instead. For instance, such global description could say which order source and destination operands should be, etc.

However, many assemblers handle several syntaxes, e.g., the AT&T and Intel assembler syntaxes. How should this be dealt with by canji? Well, currently there is no work on dealing with this, but it is technically not hard to do.

Monday, December 10, 2012

Generating tool-chains

In my previous post I outlined a way of generating interpreters and JIT compilers from a simple description of the state of the target (virtual) machine and the semantics of the instructions. I also linked to a prototype called jiggawatt where I implemented these ideas.

Since then I continued to explore what can be generated from such small description of the target machine:
  • Interpreter that doubles as a simple inference engine,
  • Optimizing JIT compiler,
  • Assembler and disassembler,titool
  • Instruction binary encoding, and
  • Documentation including textual description of each instruction.
Yes, generating textual description of each instruction. Not only that, the instructions are ordered such that similar instructions are grouped together in the documentation. In addition I've experimented with generating examples for each instruction showing how the instruction can be used, however, I didn't managed to get this entirely correct. Despite that, the end result is quite promising and I think there's potential for more (you find an example of the generated documentation here).

Unfortunately, since jiggawatt main purpose was to explore what could be done -- not to be a long-living project -- it has grown into a huge ball of untested code. But out of the ideas and experience a new project has been born: canji (yet, it is intentionally misspelled).


canji's ambitions is much greater than those of jiggawatt. With canji I aim to generate a whole tool-chain for a (virtual) machine: interpreter/simulator (with JIT), assembler/disassembler, a static optimizing compiler, ELF, debugger, and documentation with textual description and example code. I also aim to have a test-case generator that generates tests that verifies the interpreter and static compiler. Simpler things such as Emacs modes for the assembler is also possible and hopefully much more.

The concrete goal of canji is to generate an entire tool-chain, but the less obvious goal is to explore what implicit information that lies hidden in a machine description. For instance, to generate an assembler with operand type-checking (e.g., to check that immediate and register isn't mixed up) in jiggawatt I let the assembler generator assume that if an operand is used to index an array, that operand must be prefixed by the name of that array. For example, an instruction load that take two operands, an immediate and a destination register, must be written as follows:
    load #10, r0
assuming the array describing the register file is called r.

Assumptions like this may or may not be a good idea -- as I said canji is a project that explores what can be generated from a very tiny description of a (virtual) machine.

Sunday, November 25, 2012

Generating interpreters and JIT compilers

Based on a description of each register in a (virtual) machine and each instruction, shouldn't it be possible to generate an interpreter for such machine? And if it's possible to generate a interpreter, shouldn't it be possible to generate a JIT compiler as well? Additionally, if a interpreter can be generated, shouldn't it be possible to generate a "inference engine" that infers various static properties of a program, e.g., that c = a + b is 1 if a = 1 and b = 0?

Of course it is possible -- the question, however, is how such description should look. How convenient is it to write such description? Let's start with the description of the machines registers.

Assuming a simple machine with non-overlapping registers (i.e., no al/ah/ax/eax/rax mess) the register description is fairly straight forward, for instance:
    rf: int64[64]
    zflag: bit
which say that there is a register file called rf consisting of 64 64bit registers and a flag, zflag, indicating whether or not the result of the last arithmetic operation was zero.

Although the above is a description of the machine's register it's a state-less description, in the sense that it purely declarative description of structure -- there is no mutation described here. Thus, this description is denoted the structural semantic.

On top of the structure of the machine are the instructions, which mutate the registers. These are described by the operational semantics. The following describes the load instruction, which writes an immediate value to the register file:
    def load(imm: int64, dst:int6):
        rf[dst] = imm
where imm and dst are the operands to the instruction -- an immediate and the (index of the) destination register, respectively. Furthermore, rf is the register file as described by the structural semantics above.

To  give some credibility to my first claim, that it is possible to generate an interpreter from this description, note that the semantic description formulated as above can trivially be used as a interpreter (simple text transform is enough) as the following python program indicates:
    rf = [0] * 64
    zflag = 0

    def load(imm, dst):
        def execute():
            rf[dst] = imm
        return execute

    program = [load(1000, 0), load(200, 1)]
    for instr in program:
        instr()

    print(rf[0], rf[1])
Furthermore, translating the description into a more efficient C implementation is simply a matter of more trivialities and a tea spoon of relatively straight-forward type inference. But such details are not important for the argument I'm making.

Now, what about generating a JIT compiler for the load instruction? Let's start by looking at the end result, that is, how can such a JIT compiler be implemented. Here's a sample of an implementation of the load instruction implemented in C++:
    class load {
        int64_t imm; 
        unsigned dst;
        public:
            load(int64_t imm, unsigned dst) : imm(imm), dst(dst) { }
            void emit(Xbyak::CodeGenerator& cg) {
                gc.mov(gc.r8, imm);
                gc.mov(gc.r9, dst);
                gc.mov([rdi + offsetof(State, rf) + gc.r9 * 8],
                       gc.r9);
            }
    };
where we used the great Xbyak framework for translating x64 assembler into binary encoded instructions. This class can be used to emit the load instruction at runtime, that is, we have a working JIT compiler -- all generated from the four line long description above!

I'm guessing that you by now sayt show me the code. Sure, here is a proof-of-concept implementation. It is called jiggawatt and it can, as of the writing of this post, generate interpreters and jit compilers that support load instructions, add and  sub instructions, non-conditional jumps, and conditional jumps. So far it does not generate a optimizing jit compiler, however that is (one of) the end goal(s).

But so what? Why would you ever need a interpreter and jit compiler generator? Well, as a back-end for DSLs as well as full-fledge programming languages is one application that comes to mind. Emulators of physical processors or other kind of hardware is another. A third application is for prototyping instruction sets and virtual machines.

Sunday, November 4, 2012

Stringify and sequence checks in Protest

Protest is the C++ unit test framework that I've been working on for a month or two that I've written about here before. Protest improves on other frameworks by having a single über-powerful check/assert and handles fixtures really well. But since a yesterday it has become even better, as this post will describe. Let's start with the a simple yet important feature of any testing framework -- printing objects.

Stringify

All C++ unit test frameworks I've used suffer from the same illness -- the sorry-I-can't-stringify-anything flu. The prime example of this is std::vector, which has operator== overloaded, but no operator<<. This implies that std::vector can't be used in as arguments to, for instance CHECK_EQUAL in UnitTest++, because that macro requires the arguments to have operator<< implemented.

Protest solves this with two important features: 1) it can stringify std::*, and 2) a object without operator<< is simply not stringified. One issue remains though: what if operator<< is implemented but it needs to be printed differently when printed from a test? Well, of course, Protest to the rescue!

Protest doesn't hijack operator<<, it does however use it by default to print objects. This means that a object can be printed differently from tests and in production code. This is not yet documented on the wiki, but soon it will be. For the time being this example has to suffice (note however, that this code has to come before #include <protest.hh>):
struct Foo { };
void stringify(std::ostream& os, Foo const&) {
  os << "Foo";
}
#include <protest.hh>

Sequence checks

A key feature of Protest is that is only has one check/assert macro, while other frameworks either have five, ten, or even twenty; or they follow the Hamcrest route and forces you to write assert such as
ASSERT_THAT(v, Each(AllOf(Gt(10), Lt(20))));
which honestly isn't particularly easy to read, nor write. Furthermore the many-checks approach and the hamcrest-approach both fail in more complicated situations. Of course, Protest tries to remedy this, and the solution is sequence checks.

Sequence checks are checks that uses one or more sequence variable, which is essentially equivalent to a for loop. The following Protest check is equivalent to the assert in the example above and I is a sequence variable:
I(0, 3); // check elements 0, 1, and 2.
check(v[I] > 5);
check(v[I] < 20);
which obviously is more lines of code, but roughly the same number of characters. The neat thing with sequence checks is that it handles everything from simple array comparison to checking relationships between functions, e.g.,
I(0, 20, 2);
check(0 == least_significant_bit(I)); // Even numbers
check(1 == least_significant_bit(I + 1)); // Odd numbers

Sequence checks improve the quality of the your tests by enabling you to express invariants of the code you're testing without increasing the amount of test-code needed.

Conclusion

Protest ability to stringify all objects (with or without operator<<) avoids a annoying mis-feature of many other test frameworks. This feature does not, however, change how you write tests. As as step towards writing tests that express the logic of the production code Protest provides sequence checks.