Programmatically Speaking: June 2010

As a kid, most of what you did was related to learning. Some call it "playing" or "being curious", but whatever you call it, it's learning in one form or another. And learning hurts. A lot.

So you start going to school to make learning not hurt so much. As a 7-8 year-old you enjoy school, because it makes learning much easier. Also, you can play with your class mates, like say, play soccer. Oh, by the way, playing soccer makes you run faster longer, gives you better balance, gives you better feeling of how flying objects behaves (like a ball), better timing your own movements to others (like your team mates). It also makes you better winning and loosing (you aren't a bad looser, are you? How about a bad winner?), and it makes you better at focusing on a particular task, and making other (your team mates) focusing on the same task as you. It's all learning.

As you grow older you become more comfortable with most things you do because you know them. When you finally leave school to start working as a programmer you lost all will to do things that hurts... because you lost your will to learn (in comparison to how much you wanted to learn stuff as a kid). You lost your will to discover new things.

Stupidity hurts

So, when stated with a problem that forces you to do something you don't know or something you really don't like, you stall. Stall, stall, stall, because you don't want to do it. You want to do something fun. Something you know how to do. Something that makes you feel secure and comfortable. Something that makes you feel less stupid!

Yes, you are stupid. I'm stupid. We are all stupid. No one know everything, so everyone is stupid at something. And we will remain stupid at those things unless we learn how to do them better. But... Ouch! Remember? Learning hurts! So we think to yourselfs "Ooh, I don't want to do that!", or put slightly different "I may get hurt doing that! I'd rather stay stupid!". Imagine if you thought that way as a 8-year-old kid in school... or at the playground, or at the soccer field. We wouldn't get many marathon runners or Nobel prize winners that way, would we?

The cure

It's really easy: dare to be stupid. Dare to ask stupid questions. After a while, you'll notice that you start to ask really hard questions, and before you know it you'll ask questions no one (you know) have answers to. And then, as it happens, people will ask you instead. If they dare to of course (hint: they should).

In fact, I think that people will be more likely to ask you (stupid) questions because they'll seen or heard (of) you ask (stupid) questions. And you know what, people do as other people do. Especially like people they look up to, and since you know so much (from asking stupid questions) they look up to you. Isn't that neat?

The workplace

If you have colleagues who don't mind potentially looking stupid by asking a question, you've work at a good company, I think. But I'm not really someone who read or thinks about this a lot, so I may be wrong. There may be effect here that I'm clueless about. Anyway, my feelings are that I'd be more at home at a company like that than a company that encourages silence and really intelligent questions only. But I'm just me. You are you, and you may feel entirely different about this.

Only optimize when you found a bottleneck by measuring. That's what we all have been taught, right? It's a good rule. But is it really right? What if you're designing a performance critical application? Should you really be ignorant about performance until you measured and found a bottleneck?

I say: don't optimize, but plan for optimizability. I think Dwight D. Eisenhower said it best, though

Plans are nothing; planning is everything.

And now a disclaimer. I am not encouraging over-design in this post. Neither am I proposing that you should analyze every silly detail of your application domain before you start coding. I encourage you to start hacking on you application without worrying about performance, because you know (if you follow the tips that follows and use your brain) that you can optimize it later. Knowing that you can rewrite that slow and memory-hungry Ruby script into a snappy C program (that don't use any heap memory at all) makes it so much easier to sleep at night. Believe me.

Ok, now let's get going!

Caring about planning

Planning for optimizability means that you make sure that:

seams are placed to allow optimizing,
system-wide concerns are encapsulated,
it's possible to do design lowering later on,
algorithmic decisions are delayed,
intelligence can be moved from data to code to data,
it's possible to move run-time aspects to compile-time,
you know your tools.

We will now go through the first three of these bullets one by one and discuss them in detail. I will discuss the remaining four in a later post.

Seams and optimization borders

A seams in software design is a some-what vaguely defined concept. If two components of a software design are loosely coupled, we say that there is a seam between them. In other words, seams separate the design into parts that can vary independently of each other. Inheritance and dependency injection is often used to achieve this in object-oriented systems; function pointers play a similar role in procedural programming.

Since planning for optimizability means that we wish to make it easy to replace slow code with faster, we thus should place seams in the design such that they surround potentially slow code. This implies that design seams also are optimization borders; borders over which optimizations cannot easily be done without affecting other parts of the system.

This doesn't mean that it's impossible to optimize over optimization borders, just that other parts of the system will be affected by it. Thus, optimizing over optimization boarders is harder than optimizing within them. Let's take an example.

Assume that we iterate over a a large list in a performance critical part of our application:

void Foo::iterateOverLargeList(std::list<Thing> l) {
 for (int i = 0; i < l.size(); i++)
   l[i].doEasyCalculation();
}

Where should we place the design seams in this code? Or, more concretely, how do we use virtual functions to enable us to optimize this piece of code as much as possible without affecting the surrounding code? As I see it, we have the following alternatives:

make Thing::doEasyCalculation virtual,
make Foo::iterateOverLargeList virtual, or
encapsulate std::list<Thing> in a new class ThingList, move iterateOverLargeList to ThingList, and make it virtual.

Of these three alternatives 3) gives us the most freedom to increase the performance of the loop, because it allows us to do not only optimize how we iterate, but also change the underlying data structure. For instance, changing from std::list (a linked list) to std::vector (essentially a linear array) will make the access pattern linear, thus the (L1, L2, L3) cache can do a better job.

However, 3) is also the option that is the hardest to refactor to, because we have to change every part of the application that uses the std::list<Thing> list. So, if our application is large, this may involve a lot of refactoring work. Thus, if we choose 3) early in the development we don't have to refactor so much later. In other words, early understanding of the domain, the application's use-cases, and the application design, is important for us when designing for optimizability.

This example touches on another important tool when designing for optimizability: encapsulation. We will now discuss this in greater detail.

Concerning encapsulation

One of the worst kind of optimizations you can do is to optimize something that affect the entire system, e.g., the data flow. A global property like is a system-wide concern and having the system's performance depend on such property is very very bad. We need to encapsulate these global properties to make sure that there are design seams that enable us to optimize them when we need to.

This means that we need to design our code such that if we need to optimize something, that something is encapsulated into a single class (or few classes). This does not mean that we should encapsulate every little petty detail, though. It does mean, however, that we need to put a bit more effort in understanding our application, and then think about its design, data flow, etc. This is actually something we should do anyway even if we're not designing for optimizability, because understand the current design of the application make it easier to improve it.

When we got a firm understanding of the application and how it will be used, then we can identify the system-wide concerns, and only then can we start to encapsulate them to make them possible to optimize in the future.

Avoid memory copying

Reading data from memory is one of the most costly thing a modern CPU can do. If the data is in (L1) cache reading memory is fast (3-5 cycles or so), but if the data needs to be read from main memory it takes 100 times longer. That's right: 500 cycles to read a single bit of data. Writing data to main memory is also very costly operation, thus, copying is extremely costly operation.

With this in mind, it may be tempting to return a pointer to a shared data buffer instead of returning a copy of that data. For example:

class Thingie {
 int* stuff;
 // constructor that initializes 'stuff'.
 int* getStuff() { return stuff; }
};

This may look efficient since there is no memory copying going on. However, this design is horrible with regard to encapsulation. For instance, what if we need to modify the returned data in some other part of the program without affecting the Thing instance? In that case we need to make a copy and modify the copied data. But what if that copied data is returned just like the int* stuff above? And what if it needs to be modified in some other part of the program? Then we need to make another copy!

In situations like this, you need to take a step back and look at the data flow through your application and then try to optimize it to minimize the number of copies. This is not an easy task, so we really wish to avoid doing it, or at least make it easier for us to do. Is there some way to design our application (with little extra effort) from the start to make optimizations like this possible (without huge redesigns)?

In the example above, the system-wide concern is how the int* stuff data is passed around in the application and how to make sure that it is not copied unnecessarily.

Let's rewrite this example a bit into:

class Thingie2 {
 MyBuffer stuff;
 // constructor that initializes 'stuff'.
 MyBuffer getStuff() { return stuff; }
};

Since the data now is encapsulated in MyBuffer (and the entire application uses MyBuffer to refer the the data), we can now implement a lot of optimizations that affect the data-flow of the application. For instance, we can implement copy-on-write semantics and reference counting; this would be extremely hard to do without using MyBuffer to encapsulate the data-flow, which is a system-wide concern.

Design lowering

The term instruction lowering is an optimization technique used in compilers to generate more efficient code. For example, the C code a = 16 * b; can be implemented (in pseudo-assembler) by a multiplying operation A = MULT(16, B) but also by a simpler and faster shift operation A = SHL(4, B).

With the term design lowering I mean a similar concept: reimplementing the design in a simpler and faster language or platform, or using faster language constructs or data structures. Examples of such design lowering include:

rewrite Python code in C,
replace reflection-heavy Java code with simpler non-reflection code,
using a simple array instead of a hash map or a linked list,
using the memory layout of C structs as data format instead of XML in files, over network, etc
use hardware acceleration like graphics cards.

Design lowering can easily increase the performance 10-100 times or even much more. Lets take an example of this.

A story of left and right

Imagine two small program left and right; left generates data and sends it to right, which sums the received data. Here is the first version of this pair of programs implemented in Python:

--- left.py ---
a = { "first":[1, 123], "middle":[1, 2, 456], "last":[1, 2, 3, 789] }
for i in xrange(0, 1000000):
 print a

--- right.py ---
import sys
total = 0
for line in sys.stdin:
 a = eval(line)
 total = total + sum(a["first"]) + sum(a["middle"]) + sum(a["last"])
 print total

Translating this Python code into C++ idioms this becomes:

--- left.cc ---
struct Data {
 Data() {
   first[0] = 1; first[1] = 123;
   middle[0] = 1; middle[1] = 2; middle[2] = 456;
   last[0] = 1; last[1] = 2; last[2] = 3; last[3] = 789;
 }
 int first[2];
 int middle[3];
 int last[4];
};

int main() {
 Data data;
 for (size_t a = 0; a < 1000000; a++) {
   for (size_t b = 0; b < sizeof(Data); b++)
     putchar(reinterpret_cast<char*>(&data)[b]);
 }
}

--- right.cc ---
struct Data {
 int first[2];
 int middle[3];
 int last[4];
};

int main() {
 Data data;
 long total = 0;
 for (size_t a = 0; a < 1000000; a++) {
   for (size_t b = 0; b < sizeof(Data); b++)
     reinterpret_cast<char*>(&data)[b] = getchar();
   total += data.first[0] + data.first[1] + data.middle[0] + data.middle[1] +
   data.middle[2] + data.last[0] + data.last[1] + data.last[2] + data.last[3];
 }
 printf("%ld\n", total);
}

And running them we get:

$ time ./left.py | ./right.py
1378000000

real 1m3.284s
user 1m14.057s
sys 0m0.256s
$ g++ -O2 right.cc -o right && g++ -O2 left.cc -o left && time ./left | ./right
1378000000

real 0m0.763s
user 0m1.292s
sys 0m0.088s
$ echo 74.067/1.292 | bc
57

Thus, the C++ version of these little programs are 57 times faster than the equivalent Python code. But this is not the neat thing... Let's get back to discussing design lowering.

Where to do design lowering

The neat thing is that you can implement left and right in Python without worrying about performance because they are easily design lowered to C++. I can say this because left and right comes in pair; if you redesign one side you also redesign the other. In other words, the details of the implementations are free to change in ways to make them execute more efficient. Thus, it's easy to do design lowering.

However, this would not be the case if left sent data to an unknown part via a predefined protocol. If this was so, then reimplementing left in C++ would probably be considerably harder. In other words, design lowering left would be less feasible. It would still be possible to do, though.

In summary, design lowering can be used as a plan for optimizability when the design can vary in ways that make it feasible to implement faster in another language (or data structure, etc). How the design can vary depends on many things, though, thus design lowering is not always easy to do without first refactoring the design.

As before, design lowering is not something you get for free, though: you need to ha good understanding of how/where the application will be used. When you understand your application, you're more likely to make correct assumptions of what parts of that can be design lowered and which parts that can't.

Conclusions

Just like optimizing, designing for optimizability requires us to understand the problem. We cannot escape it: to do something good, we need to know what "good" is, and to know what "good" is we need to understand the domain. Beware of over-analyzing though: concentrate on system-wide concerns that are likely to affect performance, and know how the design can vary to enable design lowering. Also, make sure that design seams don't hinder optimizability.

There are still much more to discuss on this topic, so stay tuned!

Programmatically Speaking

Thursday, June 17, 2010

The only design pattern is small solutions

Monday, June 14, 2010

If it hurts do it more often