Thursday, January 17, 2013

Python type checking -- or why you should learn Prolog

A while ago I wrote about the programming language Prolog. Even longer ago I wrote about a python type-checker (and compiler) I was then working on. That type-checker and compiler never reached a state where it could compile any python code, however, it did type check quite complicated code.

Today I started working on a new type-checker for python that is based on Prolog. Prolog is a logic programming language with several features that makes it an ideal language to implement type-checking in. Here's an example of a piece of code that it successfully type-checks:
    A = 1
    B = 1.0
    C = "c"

    def plus_C(value):
        return C + value

    def func0(a):
        global A, B, C
        A = "hello"
        result = plus_C(A)
        C = 10
        result = plus_C(B) + a
        return result

The way it works is pretty straight-forward. The type-checker simply disassembles the relevant functions (using the dis module), and outputs Prolog code where each instruction is replaced with a Prolog predicate, as follows:
    py_plus_C(Module, V_value, ReturnValue) :-
        load_attr(Module, 'C', T0),
        binary_add(T0, V_value, T1),
        return_value(T1, ReturnValue).

    py_func0(Module, V_a, ReturnValue) :-
        store_attr(Module, 'A', py_str),
        load_attr(Module, 'plus_C', T0),
        load_attr(Module, 'A', T1),
call_function(Module, T0, [T1], T2),
        store_attr(Module, 'C', py_int),
load_attr(Module, 'plus_C', T3),
load_attr(Module, 'B', T4),
        call_function(Module, T3, [T4], T5),
    binary_add(T5, V_a, T6),
return_value(T6, ReturnValue).

Here, the type-checker infers the return type of func to be float, and the type of the argument a to be either int or float. Note that the definition of Module is omitted because it's mostly just an internal representation of the python module and not terribly interesting.

In addition to dealing with mutable state (mutation really complicates type-checking), it handles parametrized types (e.g., list) as the following example illustrates:
    def func1(lla, lb):
        return lla[0] + lb

where it infers that lla, lb, and the return type is:
  • list(float), float, float
  • list(int), float, float
  • list(int), int, int
  • list(str), str, str
  • list(list(X)), list(X), list(X)
  • dict(int, list(X)), list(X), list(X)
and several more alternatives. That is, the type-checker not only makes sure that a piece of code is correct with regards to types, it also infers which types are possible.

Now to the cool part.

The entire thing is 236 lines of code.

Ok, granted it doesn't handle ifs, fors, etc, but still. The type-checker can be found here. Unfortunately, I currently don't have time to continue working on it, but the code is there and proofs the idea.

Wednesday, January 16, 2013

My open source and my closed work

I usually don't write about things that actually happen to me. Instead I focus on describing tools, cool ideas, or just telling a joke. In this post, though, I'll tell an ongoing story about bringing protest into the company where I work.

protest is a neat little unit testing framework for C++ that I started working on around September last year. It's been open-source under the Boost Software Licence since its very beginning. The reason I started working on it was for one simple reason that has caused many hackers to start... well, hack... scratching an itch.

The particular itch in this case is the terrible state of C++ unit testing. I've tried several testing frameworks over the year, but all make me say yuck, do they really want me to write that? or should I really do that manually? or even worse what... I can't do that?

I've already written about protest here several times, so I won't do that again. What I will do however, is describing the process of using protest at my work. It started in November when I presented protest to my colleagues. They were positive and saw it as a good candidate for replacing UnitTest++ that we're currently using.

I'm working at a company that is very protective of it's source code and information -- for good reasons. What I am worried about is that if we started using protest without explicit acceptance and knowledge from some manager(s), I might run into problems if the source is found on the internet by the "security police" since it has my name on it (my user name on Gitorious is my real name, just as here on my blog). If they found it under my name on the internet, they can (falsely) draw the conclusion that I brought the code outside of the company.

So, to make sure this wouldn't happen I contacted a manager and explained the situation. Unfortunately, he contacted a person who specializes in law that looked into the matter in more detail. The response I got was we can't accept this, CompanyName might lose the the right to use protest if this-and-that, which wasn't true at all of course. 

I got a bit put off by this, but I finally got back to the issue this week. My response went along the following lines:

Regardless if you acknowledge and accept the license under which protest is published, you should understand that any open-source software can be used by any employee at CompanyName at any time. I know for a fact that we/CompanyName is using open-source licenced software, indeed, we rely on it daily.

I'm not sure if this was I good idea or not.

Friday, January 4, 2013

Documentation generation in canji

canji is a tool-chain generator that I've been writing about before on this blog and on gitorious. The current focus lies on inferring various attributes of instructions. Recently I've spent some time on generating a descriptive string for each instruction, which will be used for the generated instruction set manual. This blog describes the approach for doing so.

Instructions are described in a C-like language, so to generate a descriptive string for an instruction the code implementing it has to be understood on a high level. This may sound like a dauntingly complex task, but since the purpose of canji is to build virtual machines, we have a good idea of what kind of code that will be processed.

For instance, the following is needed to understood:
  • reading and writing registers,
  • reading and writing memory,
  • addition, multiplication, shifting, etc,
  • incrementing and decrementing
  • pushing and popping,
  • branching (assigning the  program counter),
  • conditional jumps, moves, etc,
  • call function,
  • return from function,
  • and more.
In theory, there are arbitrary many ways of expressing any of the above. In practice, though, there aren't particularly many ways of implementing the pop instruction, or the add instruction, or any/most of the other.

So the obvious approach to generate a descriptive string is to simply implement a bunch of real-world instructions and make the description generator generate an appropriate description.

This kind of work is time consuming but simple, and in fact not so much different to what an optimization pass of a compiler does. When a piece of code is optimized it is matched against a bunch of patterns and if any of the patterns matches, the code is rewritten according to the rules of the matched pattern.

When the description is generated for an instruction, its code is matched against a set of patterns, and when a matching pattern is matches a description is generated.

That's the simplest approach for doing this and the approach taken so far. It's likely that this will be extended in the future and have several layers of patterns that deals with different attributes of the code, e.g., conditional, etc.

Here's some example of descriptions generated by canji and the code its generated from:
  • Loads the value at memory address rsrc into rdst.  
    load src, dst
        r[dst] = mem[r[src]];
  • Stores rsrc at memory address sp and increments sp.
    push src
        mem[sp] = r[src];
        sp = sp + 1;
  • Branches to addr if  status.z == 1.
    jeq addr
        if (status.z == 1)
           pc = pc + 1;
           pc = addr;
Obviously, this is not perfect, but provided the low-level code it's generated from it pretty ok. Also, it's worth noticing that names of instructions, variables, operands, etc, aren't used used to infer any information.

The next step is to analyse the registers of the virtual memory infer what purpose they fill and generate a description.