Thursday, October 24, 2013

asmf -- a portable, low-level, jit assembler

So I've been busy lately with all kinds of big and small things, related and not related to programming. Recently I've been working on a jit assembler that I call asmf. You can find it here.

Why do I call it asmf, that not such a nice sounding name now is it? Well, it's an assembler (therefore the asm-part) and it's take inspiration from printf (therefore that f-part). Now you must be thinking "really? printf? are you seriously parsing strings to emit binary code?". Well yes, and no.

As you surely like code as much as I do, let's take an example before continuing:
    // Use the 'r' wildcard (meaning rax, rbx, etc) in an asmf emit statement.  
    void clear_reg(unsigned char*& dstbuf, unsigned dst) {  
        asmf("mov %r, 0", dst, dstbuf);  
I'm sure you can see the relation to printf? This code is passed to the asmf preprocessor, which outputs the following code:
    // mov %r, 0  
    void __asmf_mov__r__0(unsigned long op0, unsigned char*& bufp) {  
        unsigned char* buf = bufp;  
        buf[0] = 0x48 | ((op0 & 0x08) >> 3);  
        buf[1] = 0xc7 | (op0 & 0x07);  
        buf[2] = 0xc0;  
        buf[3] = 0x00;  
        buf[4] = 0x00;  
        buf[5] = 0x00;  
        buf[6] = 0x00;  
        bufp += 7;  
    void clear_reg(char* codebuf, unsigned dst) {  
        __asmf_mov__r__0(dst, dstbuf);  
That is, the call to the asmf function is replaced to a call to a generated function, and the string literal is dropped. It's a quite simple preprocessor in this regard.

But how does it come up with the newly generated function? This is the core of asmf and this is why I even bother publishing yet another jit assembler -- there are at least 4-5 tools/libraries out there already that does the above. But asmf is different to all those tools, and if you run sloccount on it you'll understand that it is different. In less than 500 lines of code you have a jit assembler for x64 -- and ARM, and Sparc, and your OpenRISC, etc. (Well, in theory at least, as I haven't tried this yet).

How? By being lazy. I knew that I would never be capable of writing a full jit assembler, and I knew that I would never have the patience to reverse-engineer the output of the assembler for all instructions. That why I wrote asmf to do this for me. Yes, asmf is not only a jit assembler -- it's a jit assembler generator and an instruction encoding reverse-engineer:er. This is why I say that asmf should work on every/most platform.

The major part of asmf is implemented, there are some thing related to usability (error messages, more command line switches, documentation) etc. The testing is unfortunately very x64 centered right now, and that has to fixed when ported to new platforms.

Saturday, September 14, 2013

Pythlog, assignments, and equations

Pythlog is a language I'm working on from time to time. It's a dialect of Python that incorporates constraint logic programming. So far it supports integers, list, tuples, strings, and user defined classes. There are a lot things still missing, but it already capable of some pretty cool stuff:
    def fac(n):  
        assert n >= 0  
        if n == 0:  
            return 1  
            return fac(n - 1) *  n

    print(fac(7)) # Prints 5040
    fac(w) = 5040
    print(w) # Prints 7
In this example we define the factorial function the a pretty straight-forward recursive manner. Then we call it to calculate the factorial of 7. The second to last line might appear a bit unorthodox though. What's going on here?

As said, Pythlog is a logic programming language. Such languages have a few features that set the aside from traditional imperative languages. On such feature is so call free variable. The code above is equivalent to:
    w = free
    assert fac(w) == 5040
where w is a free variable which is passed to fac. By asserting that the return value must be equal to 5040, the constraint satisfaction framework kicks in and solves w to 7.

I recently introduced the shorthand syntax fac(w) = 5040 for this. I'm not fully happy with it yet, because there are some non-obvious behaviors. I'm pretty sure there will be some changes in this area. For now though, it make the language at least look nicer, soon I hope it will also feel nicer.

Tuesday, May 28, 2013

Pretty ugly hacks

Recently I've (re-)discovered the more hacky kind of programming. It started after that I implemented my first bloom-filter. Somehow I ended up reading about all kinds of bit twiddling hacks, and finally I found myself in the realm of xor double linked lists.

Doubled linked list is a data structure where each node in the list has two pointers; one to the next node, and one to the previous node. The size of each node is thus 3 pointers. An xor linked list has the same logical function, but a compress representation -- it only requires 2/3 of the size of a normal linked list. This is achieved by observing that the only way to reach a node in a double linked list is either from the back or from the front of the list. Thus, it's enough to store the xor-sum of the pointer to the next node and the previous node, since either of those pointers is known. Anyway, the linked Wikipedia article explains it much better.

Well, the neighboring country to xor-lists is Tagged Pointer-land. In Tagger Pointer it's ok to do all kinds of crazy things to pointers as long as you know what you're doing. For instance you can exploit that dynamically allocated memory on most systems are aligned to 16-bits, 32-bits, 64-bit or even 128-bits borders. What can this be used for? Distinguishing between native integer values and arbitrary-precision integers for instance, or for putting small data structures (e.g., short strings) in the "pointer" rather than having to dynamically allocating a few bytes of memory.

It's actually pretty useful stuff, although -- as the title says -- pretty ugly. That is, the pretty kind of ugly.

Sunday, May 26, 2013

Features are social processes

For a while now I've been thinking about how human languages evolve compared to how computer languages are designed and how that relates to the features of the respective languages. In this post I will ramble at bit about how the meaning of software related terms is defined. I'll also discuss hard and soft features in programming languages and how the community surrounding the language affects, and is affected by, those features.

This is mostly a bunch of ideas and observations that I'm trying to put into words to 1) make me understand them better, and 2) make sure I don't forget them. If you expect a scientific survey, then I'm sorry to disappoint you. Maybe, though, you'll find food for your own thought and ideas.

What's a language?

As I see it, languages (be it the human or programming kind) are mutual agreement between the communicating parts of what a statement means. If everyone have a different opinion of what the following statement means, then it effectively doesn't have any meaning at all since we can't use it to communicate with anyone:
You need wood from a forest to create a fire.
or in computer-speak:
Fire fire = new Fire(forest.getWood());
On the other hand, when we agree upon what this phrase mean, then we can do all kinds of things: discuss its correctness, use it in another context, abstract it to cover more and general cases, write a compiler for it, etc.

For example, the common breed of C compiler accepts a lot of code most C programmers won't. It's the users of the C language that defines the subset of allowed-by-the-compiler-C that is acceptable for us to use. In other words, the C standard can say all it wants; its the C users who in the end defines the (practical) C language. It a bit like if physics would say "Our universe have 11 dimensions. Go use all of them!", but all known users of the universe are three-dimensional beings, thus, the accepted subset of the universe is three-dimensional. Sorry to break it to you, physics, but that's how it is.

Language features are all about what the community around the language make of the language. In a way, language features are just as much a technical aspect of the language as a social aspect of it. Example: is a language feature that's not accepted by the community really a feature, or is it just useless language complexity? More extreme example: a language without users but with extremely powerful features; effectively, does that language have any features at all?

I would also say that anyone aiming to develop a successful programming language (without the backing of a *caugh* Sun huge *caugh* Microsoft corporation) needs to have equally good eye for the technical aspect as well as the social aspect. (S)he needs to understand the social processes involved for getting a community of users who agree (hopefully with the language designer) on how the language should be used. (I think python is good example of such community, by the way).

What about software?

Developing software is also a social process. For example, you get requirements from your customer, you discuss the requirements in order to understand them, and you implement them. Implementing requirement are also a social process: you design the code by discussing it with your colleagues. And what words do you use for doing that?

You use words like object, generic, sort, inheritance, stack, tree, operation, method, message, reuse, client, algorithm, allocation, port, framework, mapping, service, channel, process, decoupled, assumption, resource, provider, input, interface... I could go on forever, but the point is that none of these words really mean anything if we humans don't agree on what they mean. The computer, framework, or programming language has no opinion on what "decouple the client's mapping algorithm from the port allocation" means, but programmers do. It's important it means that same to all programmers involved.

Soft and hard

How does this relate to programming language features? I think there two different kinds of features: hard features that was (deliberately) designed into the language, and soft features that are concepts and idioms that have evolved from using the language.

Hard features are concrete. You can make a list of hard features by reading the language specification. Soft features, on the other hand, are not. They are embedded in the community and to enumerate them you need to vibe with it for a while. Hard features are taught in classes and in books; soft features are learned by hacking, hacking, living and breathing, and some more hacking.

Example: C++ templates. Originally intended to provide better type-safety when writing generic code, like std::vector. The C++ community has then discovered that templates can be used for much, much more (like Boost.Spirit). There are a lot of template code written to implement various kinds of abstract features, e.g., compile-time if, compile-time strings, domain specific languages, etc. The hard feature is "write type-safe generic code". The soft features are "compile-time if", "embedded DSL", and even "it's hard, but you can evaluate everything at compile-time".

The D language took these soft features of C++ templates (e.g., compile-time if, embedded DSL) and integrated them into the core language. Thus, enabled more programmers to use them, because of easier syntax, error messages, compiler support, documentation, etc.

So when a C++ programmer talks about enabling or disabling some piece of code (s)he needs to think about some abstract concept like the enable-if template, while a D programming just thinks "static if". In fact, I don't think the D programmer even thinks "static if" because it's so natural to them, just as the more common "dynamic if" is so natural to all of us. The D programmer probably thinks in entirely other abstract concepts because his/her mind is free from the horrible details of C++ templates meta-programming.

You may argue that our mind is very good at abstracting and that this isn't a problem in practice, but I don't think that's true at all. Example: very few C++ programmer have every done something like an template computing the sinus of an angle, so when they're told to optimize a piece of code doing trigonometry what they'll do is to use some kind of table look-up. A D programmer, on the other hand, will simply slap together a sinus function that can be evaluated statically by the compiler because compile-time evaluation is nothing magic to her/him. (In fact, in D 2.0 match function will automatically be evaluated at compile-time if their arguments are constants).

What I'm saying here is that compile-time evaluation is a (hard) language feature of D but not of C++ (where it is an soft feature). Sure, you can in theory do compile-time evaluation of everything you need in C++ (templates are Turing complete), but not in practice because it's so hard that you actively avoid it. Thus, in most programmers conscious mind, C++ does not have compile-time evaluation. Similarly, you can do object-oriented programming in C, but you probably don't because it's hard and littered with horrible details. Thus, C is in most programmers mind not a object-oriented language. It's possible to do structured programing in Commodore BASIC, but most of us don't think of it like that. Got my point yet? Good.

Ok, so what?

By now you are probably thinking "that's all very interesting, but how is this useful?". Well, I did say that this post was a rambling, didn't I? :)

Seriously though, I don't really think any of this will make you a better software developer, but I think it could be useful if you are developing an tool and there's a community of users around it. Be sure to note what kinds of words the users uses, what features they ignore, what idioms they invent, etc. Integrate the words and idioms into the tool and it's documentation to make the experience for a new user of the application more consistent.

Friday, May 24, 2013

Small features distinguishes Protest

Protest is a unit testing framework for C++ that make testing fun, fast, and simple (see introduction). Recently, after using it on one of my own project, I've added a few more features to it.

First, I improved the compilation time for compiling a .cc file containing a test cases. This was embarrassingly bad earlier, but now I think its pretty good. The issue was that there were a lot functions defined in the header. I did a bit of preprocessing magic to avoid unnecessary function definition (in favor of function declarations) and that lower the compilation times dramatically.

Second, I made the check macro a bit more intelligent. Protest's check macro was already quite smart: it's capable of dealing with matchers, and splitting the left hand side and the right hand side in a comparison (e.g., foo != bar), which means that there is no need for several check macros as is common for C++ unit test frameworks.
Anyway, while working on some bit twiddling heavy code my tests tended to look something like this:
    test("twiddling #2") {
        check(f(10) == 0x1f);
        check(f(11) == 0x2f);
and when a test failed the following was printed: f(10) == 0x1f (30 == 31) failed. [suite/twiddling #2][].
which is all pretty nice, isn't it? Well, not really.

I found myself having to convert the decimal print-outs to hexadecimal before the error made sense to me -- bit twiddling is not easy in base-10... It didn't take long until I wished that my favorite unit test framework did this automatically for me. How nice then that I had the author of it so conveniently close. Sitting on the very same chair even!

Fast forward one hour and Protest now gives the following print-out for the example above f(10) == 0x1f (0x1e == 0x1f) failed. [suite/twiddling #2][].
that is, Protest recognizes that I used hexadecimal literals in the check and changes the base of the integers in the print-out appropriately.

Pretty convenient.

Sunday, May 5, 2013

pythlog -- python on constraint-logic-programming steroids

I've programmed in Python for about 5 years and I've programmed in Prolog for about half a year. I know my way around python-land reasonably well, but I'm only beginning to learn Prolog. What I have learnt in the last half year, though, is how hard it is to be taken serious when you tell another developer we should really use Prolog to solve this problem. Prolog is simply a too far-off world if you're used to python-land.

A lot about learning a new tool or language is how easy it is to leverage existing knowledge. And Prolog does not fair well in this regard. No side effect? No classes? No for statement? What's this funky if? Where's my list comprehension? Recursion -- for real?? You have to be kidding me! These are all responses of a hypothetical, but likely, programmer that learns Prolog.

Even though I'm advocating learning Prolog, there are some thing that just seem to fundamentally be designed to work against me, and I keep thinking if Prolog only was a tiny bit forgiving and accepted me for the programmer I am... and my tiny brain. In addition, Prolog lacks a lot of syntactic sugar for common operation, such as string concatenation.

A snake around a piece of wood. Get it?
As an experiment of how many of Prolog good properties can be moved into python-land, I've started working on a python-to-prolog compiler. The intention is primarily very egoistic: I want a language that makes my life as a programmer better. The way I see it marrying Python with Prolog (or the other way around?) can potentially bring a very powerful language. I'm calling this marriage pythlog and here's the techy elevator pitch:

pythlog is a language that melds the power of constraint logic programming with the ease of development of Python.

What does this mean? It means that not only can a lot of traditional Python code be compiled and run, such as

    def mersenne_prime(n):
        return 2 ** n - 1
    def format_string(a, b):
        return "[" + a + b + 2 * a + "]"
 but also search/equation solving such as:

    n = free # solve this variables
    assert mersenne_prime(n) == 618970019642690137449562111
    a = free
    b = free
    assert format_string(a, b) == "[abc, abcabc]"
where n, a, and b are free variables that will be bound to values by solving the above equations. In this example n is solved to 89 and a to "abc"b to ", ".
Before you get too excited, please note that state of pythlog is proof-of-concept: there is basic integer arithmetic, string, and list support. There is also some support for user defined classes.

What can be expected to be supported of the Python language? Well, so far my prototyping indicates that at least integers, strings, lists, sets, dicts, and user-defined classes can be supported. I've done a lot of prototyping to find out what can be done and what can't be done, and I think there is a lot of potential here. In fact, by using existing tools pythlog will support compilation to native code (using the gprolog compiler), which is cool in itself.

Monday, April 1, 2013

Five years of hacking and searching

Five years ago I write my first post on this blog. Looking back I realize that writing here has been like a hacking diary of some sort. I can see how my interests has shifted by looking on the topics of my posts.

The topic for this 110th post is Five years of hacking and searching, because it's not only been five years of hacking, fiddling, and playing around with ideas -- it has also been five years for searching.

Searching for what? Many things: patterns, tools, design, languages, etc. Ever since I took my first stumbling steps in the programming world by copying code from my Commodore Basic User's Manual, I've felt that there must be better ways of doing this -- this activity that we call hacking, programming, or developing, depending on if we're on couch or at the office.

At the time of User's Manual-copying I was a eight or nine year old kid with an exceptionally average brain capacity, and neither could I understand what the code I was typing did, nor could I imagine how the world of computing would develop -- or that I would make my living out of it for that matter. But that feeling of that things could be done better was there. Why did I have to manually save line numbers in my Basic program (you know, 10, 20, 30, etc)? Why didn't the computer do that?

After some time I found the renumber command which actually did this automatically. Cool, a command that help me writing my Basic program! This made me very excited: renumber was a program that operated on another program (source)! Imagine how awesome that though is for a little kid who just are beginning to program and learning what can be done with a computer.

A few years later my father brought a PC home from work. I was very interested in this machine, but what could I do with it? Well, not much it seemed -- there was no Basic! I don't recall how I found it out -- I guess hours and hours looking around on the hard drive -- but I finally found qbasic.exe. Oh, how I loved playing around in this environment. It was like nothing I had ever seen before. It was an editor. You ran the program in the editor. You could step programs line-by-line in the editor. The help of the editor and the language as integrated into the this environment. The QBasic language even had graphical modes! This was amazing for a computer enthusiast who's feet didn't reach the floor when he was sitting in his father's chair programming his computer through this blue-looking editor. Exploring what the computer memory contained through PEEK was fun, as was seeing it randomly crashing when I POKEd it.

Move forward a couple of year to the mid 1990's. Somehow my family got internet over 28.8 modem and somehow I realize that I can use internet to find solutions to my programming questions. I find support for mouse in QBasic, and crazy cool graphical tricks that I never understood. But I got some pretty cool stuff working with it. I remember writing a GUI application in QBasic with mouse and windowing support. I remember particularly fondly how I by accident stumble upon recursion while doing this. I remember very clearly how I looked at the screen popping one window, and then another when I pressed a GUI button, and when I closed the top window, the one below was still working. A totally awesome feeling of achievement that I never have been able to reach again.

Late 1990 and I'm introduced to Pascal in school. Pascal was cool because it actually felt like a real programming language. Why did it feel like a real programming language? Because there was a compiler, and you had to import modules, etc, to get things working. Pascal has some good things, but it never reached up to QBasic as a play-around-and-learn-as-you-go language. I did manage to write a platforming game that was inspired by a swedish comedy show, but for one reason or another writing in Pascal was more like fencing with the compiler than writing programs.

The next languages I learnt was Scheme at the university, which was the most eye-opening programming experience of my life. Somehow Scheme just made sense to me -- power wrapped in simplicity. Later I learnt C++, which was the complete opposite of learning Scheme -- C++ is a language that "just made sense" to me. C++ has a simple and beautiful core in it's C legacy (C is tiny beautiful language if you approach it from the assembly/machine level), but wrapped around this core are classes, templates, exceptions, etc. The result? Complexity wrapped around simplicity.

Later I learnt Java and Python at work. Java is C++ made simple, but it also C++ without the power. When I first learnt Java I was stunned by how fast I got from idea to working code. But Java isn't elegant. You can't design elegant application in Java, because the powers for doing so has been lost in the process of dumbing down C++. What about Python? It's a very nice language with tonnes and tonnes of nice libraries, but it's not a language that makes me a better programmer. It doesn't force me to come up with elegant solutions, on the contrary, Python encourages hacky solutions. Do I dislike Python? No. Java? No. C++? No.

Then half a year ago I learned Prolog. I don't think Prolog is a practical language, nor is it a language where the simple obvious solution to a problem is the solution that fits the language. However, Prolog makes me a better programmer and it's makes me (want to be) smarter. No other language I've played around with has done this -- except (Q)Basic when I was 11.

Why am I telling you all this? Well, as this five-years-old-celebration post it's partly about my life, but there is also a deeper point. A good language isn't a language with the most flexible type-system, the deepest meta-programming possibilities, the most coherent syntax, or the most well-defined semantics. A good language is a language the makes you a better programmer and in turn make you write better programs. Obviously, this implies that a good language isn't necessarily a practical language, but a good language is a good companion on your search for better solutions.

As I told you, my biggest personal achievement as a hobbyist programmer was when I was kid writing a GUI program from scratch. Would that be something I would bother doing today? Probably not -- there are several GUI toolkits available and writing one from scratch is a massive undertaking. Why did I do it? First, there was no GUI toolkit available for QBasic (the best of my knowledge); and second, I was already playing around with lines, boxes, and 16 colors on a 640x480 pixel screen. It was only natural to make something GUI-like out my 16-color boxes. The programming environment let me transition from text-based programs to simple graphical programs, and then move on to writing GUI toolkits. The environment helped me becoming a better programmer.

I'm wrapping up this post by mentioning something about my latest project. I wish to make a good (according to the definition above) and practical language. I'm doing this through the following equation Python + Prolog = Pythlog. Python is the language that I write most of my hobby hacks in, and Prolog is the language I write most of my interesting hobby hacks in. Python is a friendly language for solving problems, and Prolog is a friendly language for solving complicated symbolic problems. Pythlog is the sweet spot between these two languages. It allows Python-like programs to perform complicated symbolic operations in very concise ways. It allows equational reasoning about the code, and even running functions "in reverse" (what argument should I provide f such that it returns 17?).

Pythlog isn't particularly far developed, but there is a prototype compiler that pass some tests that show of Pythlog's strengths.

Wednesday, February 27, 2013

The cost of #include in Protest

Protest is a unit test framework for C++ unit test framework that distinguishes itself by having a single all-covering check/assert and deals with fixtures in an extremely slick way. I recommend that you check it out.

A few month ago I implemented stringification in Protest and I also wrote functions that stringifies all the classes in STL, std::vector, etc. This feature added some value to Protest but it also caused the compilation times to increase dramatically. The test suite of Protest ran in 2.5 minutes before this feature, and almost 4 minutes afterwards. What happened?

Well, what happened was #include-ing every possible header of STL. So how to fix it? Don't include every header of STL? Exactly.

But then the code won't compile as the used classes, e.g., std::vector, isn't declared. To get around this issue I did something really ugly... By surrounding the relevant code with #if/#endif, the stringification function for std::vector is only compiled if the vector header has been included by the client code. This is done by checking if the include guard used by vector is defined.

When this was implemented the test suite of Protest again finished in around 2.5 minutes.

Of course, this has to be done for each supported compiler, but that's a small price to pay considering ~40% less time spent on waiting for your test suite to compile. Actually, I haven't ported this to MSVC yet.

On a more philosophical note, how this feature was added without cost says a lot about Protest -- it questions the status quo, complexity, and lack of features in other C++ testing frameworks. Protest tries to improve.

Thursday, January 17, 2013

Python type checking -- or why you should learn Prolog

A while ago I wrote about the programming language Prolog. Even longer ago I wrote about a python type-checker (and compiler) I was then working on. That type-checker and compiler never reached a state where it could compile any python code, however, it did type check quite complicated code.

Today I started working on a new type-checker for python that is based on Prolog. Prolog is a logic programming language with several features that makes it an ideal language to implement type-checking in. Here's an example of a piece of code that it successfully type-checks:
    A = 1
    B = 1.0
    C = "c"

    def plus_C(value):
        return C + value

    def func0(a):
        global A, B, C
        A = "hello"
        result = plus_C(A)
        C = 10
        result = plus_C(B) + a
        return result

The way it works is pretty straight-forward. The type-checker simply disassembles the relevant functions (using the dis module), and outputs Prolog code where each instruction is replaced with a Prolog predicate, as follows:
    py_plus_C(Module, V_value, ReturnValue) :-
        load_attr(Module, 'C', T0),
        binary_add(T0, V_value, T1),
        return_value(T1, ReturnValue).

    py_func0(Module, V_a, ReturnValue) :-
        store_attr(Module, 'A', py_str),
        load_attr(Module, 'plus_C', T0),
        load_attr(Module, 'A', T1),
call_function(Module, T0, [T1], T2),
        store_attr(Module, 'C', py_int),
load_attr(Module, 'plus_C', T3),
load_attr(Module, 'B', T4),
        call_function(Module, T3, [T4], T5),
    binary_add(T5, V_a, T6),
return_value(T6, ReturnValue).

Here, the type-checker infers the return type of func to be float, and the type of the argument a to be either int or float. Note that the definition of Module is omitted because it's mostly just an internal representation of the python module and not terribly interesting.

In addition to dealing with mutable state (mutation really complicates type-checking), it handles parametrized types (e.g., list) as the following example illustrates:
    def func1(lla, lb):
        return lla[0] + lb

where it infers that lla, lb, and the return type is:
  • list(float), float, float
  • list(int), float, float
  • list(int), int, int
  • list(str), str, str
  • list(list(X)), list(X), list(X)
  • dict(int, list(X)), list(X), list(X)
and several more alternatives. That is, the type-checker not only makes sure that a piece of code is correct with regards to types, it also infers which types are possible.

Now to the cool part.

The entire thing is 236 lines of code.

Ok, granted it doesn't handle ifs, fors, etc, but still. The type-checker can be found here. Unfortunately, I currently don't have time to continue working on it, but the code is there and proofs the idea.

Wednesday, January 16, 2013

My open source and my closed work

I usually don't write about things that actually happen to me. Instead I focus on describing tools, cool ideas, or just telling a joke. In this post, though, I'll tell an ongoing story about bringing protest into the company where I work.

protest is a neat little unit testing framework for C++ that I started working on around September last year. It's been open-source under the Boost Software Licence since its very beginning. The reason I started working on it was for one simple reason that has caused many hackers to start... well, hack... scratching an itch.

The particular itch in this case is the terrible state of C++ unit testing. I've tried several testing frameworks over the year, but all make me say yuck, do they really want me to write that? or should I really do that manually? or even worse what... I can't do that?

I've already written about protest here several times, so I won't do that again. What I will do however, is describing the process of using protest at my work. It started in November when I presented protest to my colleagues. They were positive and saw it as a good candidate for replacing UnitTest++ that we're currently using.

I'm working at a company that is very protective of it's source code and information -- for good reasons. What I am worried about is that if we started using protest without explicit acceptance and knowledge from some manager(s), I might run into problems if the source is found on the internet by the "security police" since it has my name on it (my user name on Gitorious is my real name, just as here on my blog). If they found it under my name on the internet, they can (falsely) draw the conclusion that I brought the code outside of the company.

So, to make sure this wouldn't happen I contacted a manager and explained the situation. Unfortunately, he contacted a person who specializes in law that looked into the matter in more detail. The response I got was we can't accept this, CompanyName might lose the the right to use protest if this-and-that, which wasn't true at all of course. 

I got a bit put off by this, but I finally got back to the issue this week. My response went along the following lines:

Regardless if you acknowledge and accept the license under which protest is published, you should understand that any open-source software can be used by any employee at CompanyName at any time. I know for a fact that we/CompanyName is using open-source licenced software, indeed, we rely on it daily.

I'm not sure if this was I good idea or not.

Friday, January 4, 2013

Documentation generation in canji

canji is a tool-chain generator that I've been writing about before on this blog and on gitorious. The current focus lies on inferring various attributes of instructions. Recently I've spent some time on generating a descriptive string for each instruction, which will be used for the generated instruction set manual. This blog describes the approach for doing so.

Instructions are described in a C-like language, so to generate a descriptive string for an instruction the code implementing it has to be understood on a high level. This may sound like a dauntingly complex task, but since the purpose of canji is to build virtual machines, we have a good idea of what kind of code that will be processed.

For instance, the following is needed to understood:
  • reading and writing registers,
  • reading and writing memory,
  • addition, multiplication, shifting, etc,
  • incrementing and decrementing
  • pushing and popping,
  • branching (assigning the  program counter),
  • conditional jumps, moves, etc,
  • call function,
  • return from function,
  • and more.
In theory, there are arbitrary many ways of expressing any of the above. In practice, though, there aren't particularly many ways of implementing the pop instruction, or the add instruction, or any/most of the other.

So the obvious approach to generate a descriptive string is to simply implement a bunch of real-world instructions and make the description generator generate an appropriate description.

This kind of work is time consuming but simple, and in fact not so much different to what an optimization pass of a compiler does. When a piece of code is optimized it is matched against a bunch of patterns and if any of the patterns matches, the code is rewritten according to the rules of the matched pattern.

When the description is generated for an instruction, its code is matched against a set of patterns, and when a matching pattern is matches a description is generated.

That's the simplest approach for doing this and the approach taken so far. It's likely that this will be extended in the future and have several layers of patterns that deals with different attributes of the code, e.g., conditional, etc.

Here's some example of descriptions generated by canji and the code its generated from:
  • Loads the value at memory address rsrc into rdst.  
    load src, dst
        r[dst] = mem[r[src]];
  • Stores rsrc at memory address sp and increments sp.
    push src
        mem[sp] = r[src];
        sp = sp + 1;
  • Branches to addr if  status.z == 1.
    jeq addr
        if (status.z == 1)
           pc = pc + 1;
           pc = addr;
Obviously, this is not perfect, but provided the low-level code it's generated from it pretty ok. Also, it's worth noticing that names of instructions, variables, operands, etc, aren't used used to infer any information.

The next step is to analyse the registers of the virtual memory infer what purpose they fill and generate a description.