Chapter 18. Languages: Optimization

A compiler takes a program written in a high-level language and generates machine code corresponding to it. Serious compilers will work very hard to efficient code; the process of doing this is called code optimization. Often the optimization corresponds closely with the program's code, but sometimes it results in radically different changes.

In this section, we'll look at how code optimizers work by examining some of their techniques. It's useful to know what a compiler will do to your program, so that you know which transformations you generally should not apply to your programs, since they obscure your program and the compiler will do it automatically anyway.

18.1. Register allocation

One of the more important optimizations for any compiler is deciding how to use its registers most effectively. Memory accesses can take several clock cycles. If a variable is used in a long loop, placing it in a register instead of in memory can speed up a program by a factor of 10! Thus, all serious compilers have some strategies for determining where to place data. This is particularly important for machines that have very few registers, like the Intel processor, because the situation where unnecessary data goes into memory is much more likely.

We won't discuss how register allocation works here. The techniques for allocation are somewhat complex, but they don't result in surprising assembly code. Nonetheless, register allocation is one of the most essential optimizations that a compiler can perform.

18.2. Peephole optimization

The phrase peephole optimization refers to optimization techniques that require looking at only small pieces of code. The simplest type is a simple rule of thumb for transforming a short sequence of statements to an equivalent, more efficient alternative. For example, an instruction to multiply by four becomes an instruction to shift left two bits. Or a request to take x to the second power becomes a multiplication of x with itself.

The number of such tiny tricks that a strong compiler such as gcc will perform is quite large. For example, given the instruction to place 5 ⋅ R1 into R0, a compiler might easily generate the following instruction.

ADD R0, R1, R1, LSL #2

This actually computes R1 plus the result of left-shifting R1 two places. The left-shift effectively multiplies by four, so this computes R1 + 4 ⋅ R1, which is 5 ⋅ R1.

Some of the peephole optimizations seem almost magical. Told to assign R5 to be R3 / 10, where R3 is an unsigned number, a compiler might generate the following instead:

TEQ R0, R0      ; set flags so executing 0x1999999A later has no result;
LDR R0, [PC]    ; it is just loaded into R0
DCD 0x1999999A  ;   (0x1999999A is LDMNEIB R9, {...})
UMULL R1, R5, R4, R0

Notice that this involves no division! Integer division is very slow on most processors: On a Pentium III, an integer division takes 36 clock cycles, while integer multiplication takes only four cycles, and most simple operations take only one. The total time for these three instructions is six clock cycles, much better than the 36 it would take for a raw integer division. On an ARM processor, the penalty is even worse: The processor doesn't include integer division, so any general-purpose division code must be done via subroutine call. Thus, gcc opts for this alternative, even though the code's relationship to the original is unclear.

How does this work? It finds the product of n and a magic number (1999999A(16)), and then it right-shifts the top 32 bits of the product by one bit. The magic number happens to be 232 / 10, so that when we multiply it by n, we get (232 / 10) n. The upper 32 bits of this product will hold n / 10, and the code places these upper 32 bits into R5.

18.3. Removing jumps

Typically, a large portion — perhaps 10% — of a program's instructions are jumps or branches. Having very many such instructions in the program is irritating, because they don't actually produce information. Thus, avoiding jumps can be a useful way of saving time.

For an example where the compiler might prune a jump, consider the following line of C code.

if(i < 0) i = 0;

A straightforward translation into Intel assembly is the following.

do_if CMP R4, #0 ; compare i (R4) with 0
      BLT yes
      B no
yes   MOV R4, #0
no    ; code after loop

Having a branch instruction simply to skip a jump instruction, however, is a waste of an instruction. It's more effective to branch in the opposite case. Thus, a good compiler will generate the following instead.

do_if CMP R4, #0 ; compare i (R4) with 0
      BGE no
      MOV R4, #0
no    ; code after loop

The compiler has negated the condition: If i is not less than 0, it skips the body of the if statement.

(Incidentally, a compiler with a clever peephole optimizer could replace the above with a single instruction, BIC R4, R4, ASR #31.)

Similarly, good compilers will radically transmute while loops, so that the test of whether to continue comes at their end. Consider this while loop.

while(i < ni *= 2;

You might reasonably expect something like the following from the compiler.

do_while CMP R4, R5       ; compare i (R4) and n (R5)
again    BGE done
         ADD R4, R4, R4   ; i *= 2;
         B again
done     ; code after loop goes here

What it will actually generate, however is the following.

do_while CMP R4, R5       ; compare i (R4) and n (R5)
         BGE done
again    ADD R4, R4, R4   ; i *= 2
         CMP R4, R5       ; compare i (R4) and n (R5)
         BLT again
done     ; code after loop goes here

This corresponds more exactly to the following C code.

if(i < n) {
    do {
        i *= 2;
    } while(i < n);
}

By moving the test of the condition to the end of the loop, the program saves the unconditional jump. This saves an instruction per iteration of the loop: The intuitive translation takes four instructions per iteration, whereas the transmuted translation requires only three.

18.4. Common subexpression elimination

Another way to save instructions within a loop is to lift some instructions out of a loop. Often a program may have some operations inside a loop that really don't require recomputation.

for(i = 0; i < 10; i++) a[2 * n + i] = i;

Naively, we'd expect the computer to re-compute 2 n with each iteration through the loop. However, a compiler can identify that this is a waste of time, and it ends up with the following code.

do_for  MOV R4, #0     ; i = 0;
        ADD R0, R5, R6, LSL #3    ; R0 = address of a[2 * n]
again   STR R4, [R0, R4, LSL #2]  ; store i into R0[i]
        ADD R0, R0, #1
        CMP R0, #10
        BLT again

This translation manages to compute the location of array entry a[2 * n] before the loop, and then it uses that value with each iteration.

18.5. Strength reduction

A similar optimization avoids multiplication by a loop index. Consider the following C fragment.

for(i = 0; i < ni++) a[7 * i] = i;

For this code, a compiler might produce the following.

do_for  MOV R4, #0   ; i = 0;
        MOV R1, R5   ; load address of a[0] into R1
again   STR R4, [R1], #28
        ADD R4, R4, #1
        CMP R4, R6   ; compile i (R4) and n (R6)
        BLT again

Where did the 28 in the STR instruction come from? The compiler has recognized that i begins at 0 and increments each time it goes through the loop. So, it reasons, the array index (7 * i) begins at 0 and goes up by 7 each time it goes through the loop. The byte offset within the array for this index will begin at 0 and increase by 28 each time (since each array element takes four bytes). This optimized code uses R1 to track the byte offset as it goes through the loop. It first sets up R1 to be the beginning of the array, since this is where the first value of i is stored. But as it stores, it also steps forward 28 bytes to get ready for the next store.

By doing this, the compiler has replaced a multiplication inside the loop with an addition. Since multiplication typically takes a few clock cycles, while addition typically takes only one cycle, this is a major improvement.

18.6. Loop unrolling

Another way of optimizing loops is to avoid them altogether.

for(i = 0; i < 5; i++) a[i] = 0;

Rather than create a loop, a more efficient way is the following.

MOV R0, #0
MOV R1, R4    ; place address of a[0] into R1
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4

Incidentally, gcc doesn't pursue this optimization unless explicitly told to do so, because it usually lead to code that is longer and thus doesn't interact with cache as well. However, you can give gcc the option -funroll-loops, and it will pursue the optimization.

The same principle doesn't apply directly to very long loops, since it's not a good use of memory to have lots of instructions.

for(i = 0; i < 1200; i++) a[i] = i;

We wouldn't want this to transform to a sequence of 1200 instructions. But the compiler will alter the loop so that the body of several iterations of the original loop can be accomplished in a single iteration of longer loop. The assembly code generated for the above might translate to code that more closely corresponds to this instead:

for(i = 0; i < 1200; i += 15) {
    a[i] = i;           a[i +  1] = i +  1; a[i +  2] = i +  2;
    a[i +  3] = i +  3; a[i +  4] = i +  4; a[i +  5] = i +  5;
    a[i +  6] = i +  6; a[i +  7] = i +  7; a[i +  8] = i +  8;
    a[i +  9] = i +  9; a[i + 10] = i + 10; a[i + 11] = i + 11;
    a[i + 12] = i + 12; a[i + 13] = i + 13; a[i + 14] = i + 14;
}

18.7. Function inlining

Function calls have some inherent overhead associated with them. To call a function, the computer will push the arguments onto the stack, then jump into the subroutine, then execute the subroutine entry template. And, after the subroutine is done, it executes the subroutine exit template before jumping out.

For short functions, a good optimizing compiler can incorporate it into the overall program. Consider the following.

int sqr(int n) {
    return n * n;
}

int dist(int xint y) {
    return sqr(x) + sqr(y);
}

A compiler might produce the following code for dist().

dist    MUL R0, R0, R0
        MUL R1, R1, R1
        ADD R0, R0, R1
        MOV PC, LR

You'll notice that, even though dist() calls the sqr() function, there are no calls to sqr() in its compiled form. What has happened is that the compiler opted to inline the function body into its translation for dist.

18.8. Tail recursion

The final transformation that many compilers will perform is tail recursion elimination. This handles a special case in recursive functions, where the return value of the recursive call is returned directly to the calling function.

int gcd(int aint b) {
    if(b == 0)      return a;
    else if(a == 0) return b;
    else if(a > b)  return gcd(a - bb);
    else            return gcd(ab - a);
}

In this case, the compiler can safely replace replace the recursive call with a jump back to the beginning of the function. Replacing the recursive call with a jump saves the overhead of saving registers onto the stack, which can represent a significant savings.

When requested to compile this function, a compiler could generate the following code.

gcd     TEQ R1, #0        ; return a if b == 0
        BEQ gcd_a
        TEQ R0, #0        ; return b if a == 0
        BEQ gcd_b
        CMP R0, R1        ; see whether a < b
        SUBLT R0, R0, R1  ; reduce R0 if so
        SUBGE R1, R1, R0  ; reduce R1 if not
        B gcd
gcd_b   MOV R0, R1
gcd_a   MOV PC, LR

Notice how this doesn't use subroutine calls; that's the benefit of tail recursion.

Although this seems like a very special case to handle, it is common enough and represents a significant enough savings to be a worthwhile optimization for the compiler to perform.

18.9. Analysis

Compilers meant for industrial use, like gcc, place a great emphasis on optimization. They aggressively manipulate the given source code into the best equivalent assembly code that they can. This manipulation is often so extensive that the final code bears little resemblance to the original version.

A good optimizer for a compiler can be as important as a good processor. In fact, many of the best compilers are given away free by manufacturers of specialized high-performance computers. By giving away an excellent compiler with the computer, users perceive that the computer is faster than it really is. Conversely, if the best compiler available performs little optimization, it can destroy the reputation of a system. In the rush to distribute Java, for example, Sun released a compiler that skimped on optimization; even though Sun has improved the compiler since, Java is still today plagued by an exaggerated reputation of inefficiency.

Optimizers don't take the programmer out of the business of optimization altogether, though. Sometimes, some optimization would be useful, but the optimizer can't accomplish it without compromising the correspondence of the machine language output to the input code. For example, suppose we have the following code, where x, y, and z are all floating-point numbers.

y = x * z - y * z;
This says to perform two floating-point multiplications. It would be nice if the optimizer could translate it to the following.

y = (x - y) * z;

This has only one multiplication, so it would be substantially faster than before. Unfortunately, the two aren't equivalent, because the distributive law doesn't apply to floating-point arithmetic. The programmer might know that it is close enough, but the compiler can't safely assume it.

A more dramatic example is the following: Suppose the programmer writes the following.

y = f(x) * f(x);

Here, the program says to call f() twice, passing the same argument each time. This second function call with the same argument seems like wasted time, especially if f() takes a lot of time. It would be nice if the compiler could do the following instead.

y = f(x);
y = y * y;

Unfortunately, the compiler can't perform this translation, because f() may have some side effects — it may use a global variable, or it may print something for the user to see. Again, we as programmers may know that there are no side effects, but the compiler cannot assume it. Thus, as a programmer, we should perform such a translation ourself, rather than rely on the compiler to do it for us.

In most cases, though, compilers do a good enough job that worrying about writing efficient code isn't important, since the compiler will much of what you write anyway. Because of the tremendous expense of debugging time, it's better to place the top priority on writing clean, clear code. Only if you observe there is an efficiency problem should you expend effort in sacrificing clarity for efficiency; and even then, you should try to sacrifice as little clarity as possible to achieve the efficiency required.