Friday, April 29, 2011

The problem with code coverage and why static is good

How can  you automatically determine how well a test-case tests some piece of code? What metric should you use? What instrument should you use to measure it? As I'm sure you know, these questions are not easy to answer. Measuring the quality of testing is not a simple thing to do.

Normal code coverage tools are basically worthless for measuring the quality of the test-cases. All they measure is if a line of code has been executed or not; not if the test case verified the result produced by that line. Tools like Jumble are much better, but much harder to use.

However, code coverage is actually a much better  measurement of test-case quality if you language is very static, declarative, and the code hard to extend. Language like that may sound not sound very useful, but they are. They really are. There are circumstances when you really need their static, declarative and precise semantics. When you design hardware, or software for running on resource limited hardware, then you really need precise semantics with regards to the combinatoric depth, stack usage, the position of code in memory, etc. Higher level language, by their very definition, do not make you worry about those "petty details", which is good unless the "petty details" are what makes you application actually function as it should.

High level languages come with a cost, though. The more dynamic a language gets, the less you know what actually will happen when you run the program. Thus, a test-case exercising the code is further from how the code actually will be run. In fact the more dynamic a language gets, the less you can tell what any piece of code written in that language does; it all depends on how it is run. For example, what does this python function do?

def magic(string, filter):
   return [c.swapcase() for c in string if c in filter]
You cannot say for sure, because it all depends on the types of 'string' and 'filter'. Furthermore, when you see the following code:

print magic("Hello, world", "aeiou")
You cannot say for sure what it means either (even though you know the types of the arguments), because 'magic' might have been redefined to do something entirely different.

Consequently, the value of the test-case is much smaller in dynamic languages; thus, more test-cases are needed. This is nothing new, but it is worth considering before saying "I'm so much more productive in Language X than in Language Y". You didn't forget testing did you?

No comments: