Monday, January 25, 2010

Detecting function inlining

By definition, compiler optimizations should only affect the performance of the application. The behavior of the optimized code should be the same as the non-optimized code. If this is the case, how is it possible to detect whether a function has been inlined or not? Dear reader, read on (that's what readers do, right?).

Before we continue, let's take a step back and ask ourselves why it would ever be useful to detect function inlining. Let's take an example. Imagine the following code is part of a large application developed at company where we pair program. The function createObject() allocates an object on the stack (that is, it's a local variable) and returns a pointer to it:
int function() {
Object* o = createObject();
return o->field;
}
although it's bad practice to return a pointer to a local variable, function works and does precisely what you'd expect: it creates an object and returns one of its fields. As developers with good taste, we decide to clean up the code a bit by introducing a getter for field:
int function() {
Object* o = createObject();
return o->getField();
}
We compile the code and run the tests. There is a green light. Everything is good. We commit the code to the repository.

Not so fast! After a while the automated build system indicates a whole lot of failing tests. What happend?

Here's what happend: since we introduced a function call to getField we changed how the stack looks when reading the field field. And since the object is stack allocated, the object is overwritten by the call to getField. Good team players as we are, we do not want to have failing builds for long, so we revert our last check-in and start to inspect the problem closer.

We quickly determine that the only failing builds were the non-release builds, that is, those builds that failed were those that did not optimize the code. We stare into the air and ask ourselves... why?

At this point one of us remembers a blog he had read the night before about function inlining. He explains "if the call to createObject is inlined into function, then the object is allocated as a local variable in function instead of in createObject. This means that when getField is called, the object is not overwritten. It's not overwritten because its allocated in a different place on the stack. In a different stack frame, to more precise."

The other developer thinks about this for a few seconds and replies "ok, then let's just allocate the object on the heap instead of the stack". Before the blog-reading developer can say "memory leak" he adds "and, of course, delete it when we're done with it".

After a few minutes of coding and half an hour of waiting for the compiler to finish, we have a working solution that works on both optimized builds and on non-optimized builds. We commit the code and leave for the day.

When we get back the morning after, we have an angry mail in out inbox. Apparently our little fix made the performance of the application degrade horrendously -- we had placed a call to new inside an extremely tight loop. Doh! Stupid us. A better solution is needed. In order to understand the problem properly the pair programmers summarize:
  • For optimized builds the function is inlined, thus stack allocation work fine. With stack allocation the performance of the application is acceptable.
  • On non-optimized builds the function is not inlined, thus stack allocation does not work and we have to resort to heap allocation. However, heap allocation is not acceptable for performance.
When the problem is so clearly stated, the blog-reading developer exclaims "let's allocate on the stack if the function is inlined, and allocate on the heap otherwise!"

Ok, now let's leave those two developers, I think they can handle it by themselves now. I'm now going to explain how to do the trick the blog-reading developer suggested.

For every function that is called there is a stack frame which contains the function's local variables. Every stack frame has an address, which can be accessed via a register called EBP (extended base pointer, I think). This register can be read using the following code:
void* stack_frame() {
register void* ebp asm("ebp");
return ebp;
}
This function returns the pointer to the stack frame that is created when entering the function. So, as you can see, it's really simple to get the pointer to the current stack frame. Let's make another function and provide it with the pointer the stack frame of its caller:
bool is_inlined(void* callers_ebp) {
register void* my_ebp asm("ebp");
return my_ebp == callers_ebp;
}
This function is kind of magic because it will return true if its inlined and false otherwise. Using this we can write a function that allocates on the stack if the function is inlined, and allocates on the heap otherwise:
Object* createObject(void* callers_ebp) {
register void* my_ebp asm("ebp");
if (my_ebp == callers_ebp)
return &Object();
else
return new Object();
}
Pretty sweet.

Note that the EBP register is not used when compiling with -O2 or -O3 unless you use -fno-omit-frame-pointer. It should be possible to do the same trick without EPB by using the ESP register instead, but I haven't tested that.

(Disclaimer: I'm not suggesting that this is a good way for solving the problem the two developers in the story faced. What I do say it that in those rare circumstances when the behavior of a function need to change when its inlined, then this is one way to do it.)

Sunday, January 24, 2010

Faster and optimization friendly: speculative stack allocation

I've have had an idea for a memory allocation algorithm for some time. It's based on the idea behind escape analysis. Escape analysis analyzes the source code and determines the dynamic scope of an allocated object. For example, in the following Java code:
void printNumber(int i) {
System.out.print(new Formatter(i).hex());
}
escape analysis can help improve performance by determining that the instance of Formatter cannot escape the stack frame it was allocated in. Thus, the Formatter instance can be allocated on the stack instead, which (greatly) improves performance. Since Java 6u14 there has been experimental escape analysis in the JVM.

For languages such as C++, which rely on static compilation and optimization, escape analysis make it possible to do even more optimization. Consider the following C++ code:
int two() {
Cons second(2, NULL);
Cons first(1, &first);
return first.cdr()->car();
}
which creates a linked list of cons cells (1 (2 NULL)) and returns the first value of the second cons cell. That is, the function returns 2. GCC (and most other compilers as well (I will use the name GCC in this post meaning your favourite C++ compiler)) will gladly optimize this to simply return 2, it doesn't even bother to create the Cons objects at all (unless there are some kind of side-effect in the Cons constructor, like printing). Furthermore, since GCC now knows that this function simply returns 2, GCC can replace all calls to two with the value 2. Thus, the overall performance will increase even more.

However, what if we cannot allocate the Cons cell on the stack... what if we must allocate on the heap. What kind of code will GCC generate then? I'll spare you the list of assembler code here, but let me tell you it isn't pretty. Its a far cry from the assembler equivalent of return 2;. Also, allocating on the heap is an extremely costly operation in comparison to stack allocation. So, not only is heap allocation expensive, it also stops GCC from performing any kind of smart optimizations.

So what has escape analysis to do with the idea I have for memory allocation? The idea is to perform escape analysis at run-time instead of at compile-time. That is, determine at run-time if an object is reachable from outside its stack frame; and if it is move it to the heap.

This boils down to insert checks at certain places in the source code, e.g., when returning a pointer from a function. These check determines if the object (which is returned) will be reachable from a stack frame with longer lifespan than the stack frame in which the object was allocated.

Now, inserting checks at certain places may not sound like a very good idea, but the application I'm thinking of for this right now is not manually written code; it's generated code. A proper implementation of this idea would only (explicitly) allocate object on the stack and then (implicitly) move object to the heap when needed. Thus the code for allocating an object always looks the same; it does not depend on the lifetime of the allocated object. Code that look the same is easier to generate, so this is a good thing for code generators. Furthermore, extra checks that I mentioned above should be easily generated by a code generator since those checks also always look the same. Pretty neat.

Another thing that is important to note is that the algorithm uses data that is only available at run-time when deciding if an object should be moved to the heap or not. This means that this algorithm is independent on the complexity of the code or knowledge of data. On the other hand, normal (static) escape analysis may fail if the code is very complex, e.g., a lot of execution paths that depends on statically unknown data.

There is one more thing with this approach that is important: it plays well with GCCs optimizations. Since every object is stack allocated (until the run-time system notices that it needs to be moved to the heap), optimization like the one discussed above for function two can be done by GCC. That's pretty neat as well.

I've implemented a prototype of this idea here. It's not at all well-tested and I haven't tried it on any other machine than my laptop, so there are probably a lot of problems with it. The idea, however, I think is not entirely insane and could be useful.

Friday, January 15, 2010

Your decompiler is my compiler

Last night while brushing my teeth I thought about a recent project I've worked with, namely compiling Java to native code. The basic idea is the same as my other project compiling python statically: translate the byte code into C/C++ and then use any C/C++ compiler to compile it to native code.

There is a neat tool called javap that prints a bunch of interesting stuff of a Java class: the byte code of all its methods, and the type of its fields, among other things. I simply translated the sequence of byte code into something gcc could compile, and I got a primitive Java compiler that compiles to native code. Well, right now it only supports a small subset of Java... and it probably will stay that way too. :) I have a good habit of starting many project and a bad habit of never finishing them.

As I've discussed before Java and C++ are textually quite similar, for instance is the following code Java or C++?
if (var >= 0) return new ArrayList<String>(“hello”);
(I intended it to be Java, but the only reason for it to be more Java-ish than C++-ish is the name of the types: ArrayList instead of list and String instead of string.)
As you probably know, there are several decompilers for Java. Just google it and you'll find some. What these decompiler do is to print the Java code that behave exactly as the original byte code. No surprise there; that's what decompilers do!

Since Java and C++ are such similar languages textually, much more similar than Java and C or Java and Ada, its pretty straight-forward to turn a Java decompiler into a bytecode-to-C++ compiler. What I just said may sound like it turns the decompiler inside-out, but the only difference are (comparatively) small changes to the textual output.

Of course, such compiler will not support any of Java's good stuff, such as reflection, but that's not the point here. The point is that it's possible to make a compiler for Java by hacking an existing decompiler, and doing so is fairly easy because of the reuse of a C++ compiler.

Total awesomeness.

Wednesday, January 6, 2010

Print sucks, tuples rocks: disassembling Python

I recently worked on compiling Python by translating it to C++. I did this using the dis module, which prints the bytecode of a Python function. Note the word prints. There is no (simple) way of getting the bytecode of a function in a list or similar; just having it printed to the console.

At first I parsed the textual output that dis gives, but this turned out to not work for more complex stuff. Basically, the textual output did not contain all the information I needed (printing repr(obj) throws away information for some values). What to do?

Well, luckily Python is distributed with the source code of dis, so it was easy to create a copy of the function that disassembles Python functions into bytecode and rewrite it a bit. Essentially, instead of calling print, everything is stuffed into a list. Thus, you'll get a list of tuples instead of a textual output to the screen. Each tuple holds the source code line, the byte code line, the opcode and the argument to the opcode, making it easy to access them programmatically. If you would like to see the source look here (the first 60 lines).

More importantly though, this patch does not just makes the dis module a bit easier to use. As I said earlier, the textual output of dis throws away certain important information. Howerver, this patch retains the actual values instead of converting them to strings using repr. This is really important for me because it makes it possible to translate certain code which otherwise would have been impossible. More precisely, it makes it possible to translate lambdas, list comprehensions, etc, from Python to C++. The reason is that instead of getting repr(function_object) I now get function_object itself. Awesomeness!

Oh, did I mention that Python rocks?

Of banks and doorbells

This is an amusing kafkaesque story form real life by Niel Fraser at Google.

Tuesday, January 5, 2010

History of Microsoft

I just saw a few episodes of web show The History of Microsoft on Channel 9, which I thought was quite interesting. Channel 9 is web site by Microsoft so I assume that the show focuses on the good parts, but it's still a good watch.

Sunday, January 3, 2010

Statically typed and compiled Python

It was a while since I wrote something here. In spring last year, that is 2009, I started on a really fun project: writing a compiler for a language I invented myself (previously mentioned here). Ok, the "compiler" only compiled to C++-code, but still...

The language is called Fit and is a compiled (of course :)) statically-typed prototype-based language. It's heavily inspired by Java and Ruby, has first-class functions, generators, list literals, and string interpolation, among other things. All these things aren't supported completely by the compiler; there is only a rudimentary (by which I mean buggy) implementation.

I worked alot on this project. Many late nights. Four hours of sleep before getting up to work wasn't unusual. Naturally I got really tired and when the compiler was able to compile a decent subset of the language I originally imagined, I stopped working on it. It just wasn't fun anymore.

I learnt alot from this project. Writing parsers, writing efficient C++ code, optimization tricks, how to implement closures, and... oh yeah, I learnt Python. Why? The compiler is written in Python.

Here's the funny part: after working on Fit and its compiler for a while, I realized that the language I was trying to create essentially was a statically typed Python with braces. And after writing Python for a few weeks I was ready to throw away the braces. What's left is a statically typed Python (take a look at Boo if this kind of language sounds interesting). The rest of this post is about creating a compiler for a statically typed Python with minimal effort.

I actually tried writing a parser for such a language, but I failed miserably so I dropped the idea. However, I have recently realized that it is not necessary to write a parser to create a statically typed Python. Here's why: in Python 3 it is possible to annotate functions to indicate the types of the arguments. Example:
def plus(i: int, j: int) -> int:
__return i + j

These annotations provide the type information needed to convert Python to C++ code, which then can be statically type-checked and compiled. The first step for this is to get the signature for the C++ function. The following code returns the C++ function signature corresponding to a given Python function:
def signature(func):
__import inspect
__spec = inspect.getfullargspec(func)
__c_args = [spec.annotations[arg].__name__ + " " + arg for arg in spec.args]
__return_type = spec.annotations['return'].__name__
__return return_type + " " + func.__name__ + "(" + ", ".join(c_args) + ")"
For example, calling signature(plus) will result in the string int plus(int i, int j), which of course is the signature of plus in C-speak.

Ok, so how is this useful you may wonder. We need to convert the body of a function not just its signature. Well, here's where another great Python module comes into action, namely dis, which returns the bytecode of a Python function. For example, when called with plus as argument, dis get us:
1 0 LOAD_FAST 0 (i)
3 LOAD_FAST 1 (j)
6 BINARY_ADD
7 RETURN_VALUE

I won't go into details here, but it fairly easy to translate this bytecode sequence into C++ code. I've written a translator that supports only a few bytecodes (LOAD_FAST, LOAD_CONST, STORE_FAST, BINARY_ADD, BINARY_MULTIPLY, and RETURN_VALUE) and with limited support for types. On the other hand, I only spent a few hours on it. That's the "minimal effort" I was talking about earlier. :) The following Python code is small example of what the translator currently supports:
def testing(i: int, j: str) -> int:
__text = "hello"
__dog = 9.9
__donald = i
__goofy = dog
__donald = donald + 100
__daisy = donald * 4 + donald
__return daisy * 223

And this is the corresponding C++ code:
int testing(int i, str j) {
__str text = "hello";
__float dog = 9.9;
__int donald = i;
__float goofy = dog;
__donald = (donald + 100);
__int daisy = ((donald * 4) + donald);
__return (daisy * 223);
}

As you can see in this example, the C++ code matches the Python code quite well. Also, note that local variables are declared and typed automatically. Awesome! The best part is that the whole translator is only 100 lines of Python! Of course it doesn't support all bytecodes and does not have proper type inference but it definitely works as a proof-of-concept. As I said, I only spent a few hours on the translator.

So, assuming this translator ever gets finished, what could it be used for? Anything, I would say. As a general purpose compiled language just like C++ or Java (if you consider that compiled), or as a replacement for normal Python when performance or memory consumption is an issue. I also find it really neat that you have an interpreted language and a compiled language in one; the same language can be used to hack together a small prototype and to write a large well-specified application. In other words, you can stay in the same language and simply "refactor" your hacky prototype into a not-so-hacky application. I find this a very attractive idea.