Monday, February 19, 2007

Optimization is your enemy

Optimization is your enemy
Some many years ago, as I indicated in the introduction, I worked on a large (16-processor) multiprocessor system. We were using custom-modified PDP-11 minicomputers, which were relatively slow. We were programming them in Bliss-11, which as far as I've been able to tell still holds the record for the world's tightest optimizing compiler (although I've seen some quite impressive optimizations in Microsoft C/C++). After doing some performance measurement, we determined that the paging algorithm was the outstanding performance bottleneck. So our first assumption was that the paging algorithm was faulty. We examined the code, and the paging algorithm maintainer rewrote it, taking our performance data into account, and had a new, and much faster, page management algorithm installed within a week.

Meanwhile, up at MIT, MULTICS was still running. They traced a serious performance problem to the paging system. Because it was written in a PL/1 like language, EPL, the assumption was that because it was written in a high-level language, the code was suboptimal, so they launched an effort to rewrite the page management algorithm in assembly code. A year later, the code was working, and was put into the production system. Performance dropped 5%. Upon inspection, it was found that the fundamental algorithm was at fault. They took the EPL code, rewrote the algorithm, and had the improved algorithm working and installed in a few weeks. The lesson: don't optimize something that is not the problem. Understand the problem first. Then, and only then, do the optimization. Otherwise, the optimization is a waste of time and may even make the performance worse.

In the Bliss compiler, the 'register' attribute on a variable said to the compiler "You will assign this variable to a register". In C, it means "I'd like you to assign this variable to a register". Many programmers decided that they should put certain variables in registers to get better code. But the Bliss compiler was very good; it had a very sophisticated register allocation system, and in the absence of direction from the programmer felt free to assign a variable to a register if that produced the best code. Explicitly assigning a variable to a register meant that the register was not available for common subexpressions, particularly those subexpressions implicit in data structure access. After a fair amount of experiments, we determined that almost invariably adding 'register' attributes to declarations produced significantly worse code than letting the compiler do the assignment. For inner-loop code, many hours of effort would usually result in a small improvement in performance, but it was generally understood by the programmers that unless you studied the generated machine code and did a number of calibrated experiments, any attempts at optimization would make the code worse.

If you've heard of the SPECmark performance, you may also be aware of how they are gamed. In particular, IBM wrote a program that took a FORTRAN program as input, such as the SPEC matrix-multiply benchmark, and produced another FORTRAN program as output, but one which was optimized for the cache architecture of the machine it would run on. A small number of parameters described all the cache strategies of each of the models of the RISC 6000 product line. The "optimized" original FORTRAN program performed on one machine at 45 SPECmarks. After being transformed, it performed on the same machine at over 900 SPECmarks. That's a factor of 20 performance improvement based solely on fourth-order effects, cache line hits. If you're doing image processing, particularly of large images, an awareness of cache issues (even relatively machine-independent) can buy you an order of magnitude performance.

A naive approach, optimizing at the code-line level, is not nearly as effective as higher-order optimizations. Paging optimizations, cache line optimizations, and memory allocation optimizations can often have vastly more significant effects than code-line optimization. Algorithmic optimizations are the next best bet, particularly if your problem is not amenable to paging/cache optimizations. Only after all these have been done, and, of course, you have measured everything in sight, does it make sense to start doing line-level optimizations. And if your problem domain demands it, it can even make sense to recode the inner loops, particularly of such algorithms as convolution algorithms and DSP algorithms, in assembly code, most often to take advantage of the instructions such as for MMX and streaming media.

Perhaps the best example of pure programmer stupidity in "optimizing" code occurred when I was porting a large library we used for our research project. Think of it as a 16-bit-to-32-bit port (it was an 18-bit-to-36-bit port, and the language wasn't C, but the details don't matter--you can write ghastly code in any language, and I've seen C programmers do things just as badly). The port mostly worked, but we had a really strange problem that showed up only under some rare conditions, but which crashed the program using the library. I started looking. The heap was damaged. When I found how the heap was being damaged, it was being damaged via a bad pointer which allowed a store into a random place in the heap. OK, how did that pointer get bad? Four levels of storage damage down, and after 12 solid hours of debugging, I found the real culprit. By why did it fail? Another 5 hours, and I found that the programmer who wrote the constructor code for the data structure had a struct-equivalent something like {char * p1; char * p2;} where the pointers had been 16-bit, and we now used 32-bit pointers. In looking at the initialization code, instead of seeing something like something->p1 = NULL; something->p2= NULL;, I found the moral equivalent of (*(DWORD*)&something.p1) = 0! When I confronted the programmer, he justified it by explaining that he was now able to zero out two pointers with only a single doubleword store instruction (it wasn't an x86 computer, but a mainframe), and wasn't that a clever optimization? Of course, when the pointers became 32-bit pointers, this optimization only zeroed one of the two pointers, leaving the other either NULL (most of the time), or, occasionally, pointing into the heap in an eventually destructive manner. I pointed out that this optimization happened once, at object creation time; the average application that used our library created perhaps six of these objects, and that according to the CPU data of the day before I'd spent not only 17 hours of my time but 6 hours of CPU time, and that if we fixed the bug and ran the formerly-failing program continuously, starting it up instantly after it finished, for fourteen years, the time saved by his clever hack would just about break even with the CPU time required to find and fix this piece of gratuitous nonsense. Several years later he was still doing tricks like this; some people never learn.

-----------------------
Optimization: Your Worst Enemy
By Joseph M. Newcomer.

No comments: