Sunday, January 24, 2010

Faster and optimization friendly: speculative stack allocation

I've have had an idea for a memory allocation algorithm for some time. It's based on the idea behind escape analysis. Escape analysis analyzes the source code and determines the dynamic scope of an allocated object. For example, in the following Java code:
void printNumber(int i) {
System.out.print(new Formatter(i).hex());
}
escape analysis can help improve performance by determining that the instance of Formatter cannot escape the stack frame it was allocated in. Thus, the Formatter instance can be allocated on the stack instead, which (greatly) improves performance. Since Java 6u14 there has been experimental escape analysis in the JVM.

For languages such as C++, which rely on static compilation and optimization, escape analysis make it possible to do even more optimization. Consider the following C++ code:
int two() {
Cons second(2, NULL);
Cons first(1, &first);
return first.cdr()->car();
}
which creates a linked list of cons cells (1 (2 NULL)) and returns the first value of the second cons cell. That is, the function returns 2. GCC (and most other compilers as well (I will use the name GCC in this post meaning your favourite C++ compiler)) will gladly optimize this to simply return 2, it doesn't even bother to create the Cons objects at all (unless there are some kind of side-effect in the Cons constructor, like printing). Furthermore, since GCC now knows that this function simply returns 2, GCC can replace all calls to two with the value 2. Thus, the overall performance will increase even more.

However, what if we cannot allocate the Cons cell on the stack... what if we must allocate on the heap. What kind of code will GCC generate then? I'll spare you the list of assembler code here, but let me tell you it isn't pretty. Its a far cry from the assembler equivalent of return 2;. Also, allocating on the heap is an extremely costly operation in comparison to stack allocation. So, not only is heap allocation expensive, it also stops GCC from performing any kind of smart optimizations.

So what has escape analysis to do with the idea I have for memory allocation? The idea is to perform escape analysis at run-time instead of at compile-time. That is, determine at run-time if an object is reachable from outside its stack frame; and if it is move it to the heap.

This boils down to insert checks at certain places in the source code, e.g., when returning a pointer from a function. These check determines if the object (which is returned) will be reachable from a stack frame with longer lifespan than the stack frame in which the object was allocated.

Now, inserting checks at certain places may not sound like a very good idea, but the application I'm thinking of for this right now is not manually written code; it's generated code. A proper implementation of this idea would only (explicitly) allocate object on the stack and then (implicitly) move object to the heap when needed. Thus the code for allocating an object always looks the same; it does not depend on the lifetime of the allocated object. Code that look the same is easier to generate, so this is a good thing for code generators. Furthermore, extra checks that I mentioned above should be easily generated by a code generator since those checks also always look the same. Pretty neat.

Another thing that is important to note is that the algorithm uses data that is only available at run-time when deciding if an object should be moved to the heap or not. This means that this algorithm is independent on the complexity of the code or knowledge of data. On the other hand, normal (static) escape analysis may fail if the code is very complex, e.g., a lot of execution paths that depends on statically unknown data.

There is one more thing with this approach that is important: it plays well with GCCs optimizations. Since every object is stack allocated (until the run-time system notices that it needs to be moved to the heap), optimization like the one discussed above for function two can be done by GCC. That's pretty neat as well.

I've implemented a prototype of this idea here. It's not at all well-tested and I haven't tried it on any other machine than my laptop, so there are probably a lot of problems with it. The idea, however, I think is not entirely insane and could be useful.

No comments: