Chapter 18. Languages: Optimization
A compiler takes a program written in a high-level language and generates machine code corresponding to it. Serious compilers will work very hard to efficient code; the process of doing this is called code optimization. Often the optimization corresponds closely with the program's code, but sometimes it results in radically different changes.
In this section, we'll look at how code optimizers work by examining some of their techniques. It's useful to know what a compiler will do to your program, so that you know which transformations you generally should not apply to your programs, since they obscure your program and the compiler will do it automatically anyway.
18.1. Register allocation
One of the more important optimizations for any compiler is deciding how to use its registers most effectively. Memory accesses can take several clock cycles. If a variable is used in a long loop, placing it in a register instead of in memory can speed up a program by a factor of 10! Thus, all serious compilers have some strategies for determining where to place data. This is particularly important for machines that have very few registers, like the Intel processor, because the situation where unnecessary data goes into memory is much more likely.
We won't discuss how register allocation works here. The techniques for allocation are somewhat complex, but they don't result in surprising assembly code. Nonetheless, register allocation is one of the most essential optimizations that a compiler can perform.
18.2. Peephole optimization
The phrase peephole optimization refers to optimization techniques that require looking at only small pieces of code. The simplest type is a simple rule of thumb for transforming a short sequence of statements to an equivalent, more efficient alternative. For example, an instruction to multiply by four becomes an instruction to shift left two bits. Or a request to take x to the second power becomes a multiplication of x with itself.
The number of such tiny tricks that a strong compiler such as
gcc will perform is
quite large. For example, given the instruction to place
5 ⋅ R1 into R0,
a compiler might easily generate the following instruction.
ADD R0, R1, R1, LSL #2
This actually computes R1 plus the result of left-shifting
R1 two places. The left-shift effectively multiplies by four,
so this computes R1 + 4 ⋅ R1,
which is 5 ⋅ R1.
Some of the peephole optimizations seem almost magical.
Told to assign R5 to be R3 / 10,
where R3 is an unsigned number, a compiler might generate
the following instead:
TEQ R0, R0 ; set flags so executing 0x1999999A later has no result;
LDR R0, [PC] ; it is just loaded into R0
DCD 0x1999999A ; (0x1999999A is LDMNEIB R9, {...})
UMULL R1, R5, R4, R0
Notice that this involves no division! Integer division is very slow on most processors: On a Pentium III, an integer division takes 36 clock cycles, while integer multiplication takes only four cycles, and most simple operations take only one. The total time for these three instructions is six clock cycles, much better than the 36 it would take for a raw integer division. On an ARM processor, the penalty is even worse: The processor doesn't include integer division, so any general-purpose division code must be done via subroutine call. Thus, gcc opts for this alternative, even though the code's relationship to the original is unclear.
How does this work?
It finds the product of n and a magic number
(1999999A(16)),
and then it right-shifts the top 32 bits of the product by one bit.
The magic number happens to be
232 / 10, so that when we multiply
it by n, we get
(232 / 10) n.
The upper 32 bits of this product
will hold n / 10, and the code places these upper
32 bits into R5.
18.3. Removing jumps
Typically, a large portion — perhaps 10% — of a program's instructions are jumps or branches. Having very many such instructions in the program is irritating, because they don't actually produce information. Thus, avoiding jumps can be a useful way of saving time.
For an example where the compiler might prune a jump, consider the following line of C code.
if(i < 0) i = 0;
A straightforward translation into Intel assembly is the following.
do_if CMP R4, #0 ; compare i (R4) with 0
BLT yes
B no
yes MOV R4, #0
no ; code after loop
Having a branch instruction simply to skip a jump instruction, however, is a waste of an instruction. It's more effective to branch in the opposite case. Thus, a good compiler will generate the following instead.
do_if CMP R4, #0 ; compare i (R4) with 0
BGE no
MOV R4, #0
no ; code after loop
The compiler has negated the condition: If i is not less
than 0, it skips the body of the if statement.
(Incidentally, a compiler with a clever peephole optimizer
could replace the above with a single instruction,
BIC R4, R4, ASR #31.)
Similarly, good compilers will radically transmute while loops,
so that the test of whether to continue comes at their end.
Consider this while loop.
while(i < n) i *= 2;
You might reasonably expect something like the following from the compiler.
do_while CMP R4, R5 ; compare i (R4) and n (R5)
again BGE done
ADD R4, R4, R4 ; i *= 2;
B again
done ; code after loop goes here
What it will actually generate, however is the following.
do_while CMP R4, R5 ; compare i (R4) and n (R5)
BGE done
again ADD R4, R4, R4 ; i *= 2
CMP R4, R5 ; compare i (R4) and n (R5)
BLT again
done ; code after loop goes here
This corresponds more exactly to the following C code.
if(i < n) {
do {
i *= 2;
} while(i < n);
}
By moving the test of the condition to the end of the loop, the program saves the unconditional jump. This saves an instruction per iteration of the loop: The intuitive translation takes four instructions per iteration, whereas the transmuted translation requires only three.
18.4. Common subexpression elimination
Another way to save instructions within a loop is to lift
some
instructions out of a loop. Often a program may have some operations
inside a loop that really don't require recomputation.
for(i = 0; i < 10; i++) a[2 * n + i] = i;
Naively, we'd expect the computer to re-compute 2 n with each iteration through the loop. However, a compiler can identify that this is a waste of time, and it ends up with the following code.
do_for MOV R4, #0 ; i = 0;
ADD R0, R5, R6, LSL #3 ; R0 = address of a[2 * n]
again STR R4, [R0, R4, LSL #2] ; store i into R0[i]
ADD R0, R0, #1
CMP R0, #10
BLT again
This translation manages to compute the location of
array entry a[2 * n] before the loop, and then it uses that value
with each iteration.
18.5. Strength reduction
A similar optimization avoids multiplication by a loop index. Consider the following C fragment.
for(i = 0; i < n; i++) a[7 * i] = i;
For this code, a compiler might produce the following.
do_for MOV R4, #0 ; i = 0;
MOV R1, R5 ; load address of a[0] into R1
again STR R4, [R1], #28
ADD R4, R4, #1
CMP R4, R6 ; compile i (R4) and n (R6)
BLT again
Where did the 28 in the STR instruction come from?
The compiler has recognized that i begins at 0 and increments
each time it goes through the loop. So, it reasons, the array index
(7 * i)
begins at 0 and goes up by 7 each time it goes through the loop.
The byte offset within the array for this index will begin at 0 and increase by 28 each time
(since each array element takes four bytes).
This optimized code uses R1 to track the byte offset as it
goes through the loop. It first sets up R1 to be the beginning
of the array, since this is where the first value of i is stored.
But as it stores, it also steps forward 28 bytes to get ready for
the next store.
By doing this, the compiler has replaced a multiplication inside the loop with an addition. Since multiplication typically takes a few clock cycles, while addition typically takes only one cycle, this is a major improvement.
18.6. Loop unrolling
Another way of optimizing loops is to avoid them altogether.
for(i = 0; i < 5; i++) a[i] = 0;
Rather than create a loop, a more efficient way is the following.
MOV R0, #0
MOV R1, R4 ; place address of a[0] into R1
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4
STR R0, [R1], #4
Incidentally, gcc doesn't pursue this optimization unless
explicitly told to do so, because it usually lead to code that is longer
and thus doesn't interact with cache as well. However, you can give
gcc the option -funroll-loops
, and it will
pursue the optimization.
The same principle doesn't apply directly to very long loops, since it's not a good use of memory to have lots of instructions.
for(i = 0; i < 1200; i++) a[i] = i;
We wouldn't want this to transform to a sequence of 1200 instructions. But the compiler will alter the loop so that the body of several iterations of the original loop can be accomplished in a single iteration of longer loop. The assembly code generated for the above might translate to code that more closely corresponds to this instead:
for(i = 0; i < 1200; i += 15) {
a[i] = i; a[i + 1] = i + 1; a[i + 2] = i + 2;
a[i + 3] = i + 3; a[i + 4] = i + 4; a[i + 5] = i + 5;
a[i + 6] = i + 6; a[i + 7] = i + 7; a[i + 8] = i + 8;
a[i + 9] = i + 9; a[i + 10] = i + 10; a[i + 11] = i + 11;
a[i + 12] = i + 12; a[i + 13] = i + 13; a[i + 14] = i + 14;
}
18.7. Function inlining
Function calls have some inherent overhead associated with them. To call a function, the computer will push the arguments onto the stack, then jump into the subroutine, then execute the subroutine entry template. And, after the subroutine is done, it executes the subroutine exit template before jumping out.
For short functions, a good optimizing compiler can incorporate it into the overall program. Consider the following.
int sqr(int n) {
return n * n;
}
int dist(int x, int y) {
return sqr(x) + sqr(y);
}
A compiler might produce the following code for
dist().
dist MUL R0, R0, R0
MUL R1, R1, R1
ADD R0, R0, R1
MOV PC, LR
You'll notice that, even though dist() calls
the sqr() function, there are no calls to sqr() in its
compiled form.
What has happened is that the compiler opted to inline the function body
into its translation for dist.
18.8. Tail recursion
The final transformation that many compilers will perform is tail recursion elimination. This handles a special case in recursive functions, where the return value of the recursive call is returned directly to the calling function.
int gcd(int a, int b) {
if(b == 0) return a;
else if(a == 0) return b;
else if(a > b) return gcd(a - b, b);
else return gcd(a, b - a);
}
In this case, the compiler can safely replace replace the recursive call with a jump back to the beginning of the function. Replacing the recursive call with a jump saves the overhead of saving registers onto the stack, which can represent a significant savings.
When requested to compile this function, a compiler could generate the following code.
gcd TEQ R1, #0 ; return a if b == 0
BEQ gcd_a
TEQ R0, #0 ; return b if a == 0
BEQ gcd_b
CMP R0, R1 ; see whether a < b
SUBLT R0, R0, R1 ; reduce R0 if so
SUBGE R1, R1, R0 ; reduce R1 if not
B gcd
gcd_b MOV R0, R1
gcd_a MOV PC, LR
Notice how this doesn't use subroutine calls; that's the benefit of tail recursion.
Although this seems like a very special case to handle, it is common enough and represents a significant enough savings to be a worthwhile optimization for the compiler to perform.
18.9. Analysis
Compilers meant for industrial use, like gcc, place a great emphasis on optimization. They aggressively manipulate the given source code into the best equivalent assembly code that they can. This manipulation is often so extensive that the final code bears little resemblance to the original version.
A good optimizer for a compiler can be as important as a good processor. In fact, many of the best compilers are given away free by manufacturers of specialized high-performance computers. By giving away an excellent compiler with the computer, users perceive that the computer is faster than it really is. Conversely, if the best compiler available performs little optimization, it can destroy the reputation of a system. In the rush to distribute Java, for example, Sun released a compiler that skimped on optimization; even though Sun has improved the compiler since, Java is still today plagued by an exaggerated reputation of inefficiency.
Optimizers don't take the programmer out of the business of optimization
altogether, though. Sometimes, some optimization would be useful, but
the optimizer can't accomplish it without compromising the
correspondence of the machine language output to the input code.
For example, suppose we have the following code, where x,
y, and z are all floating-point numbers.
y = x * z - y * z;
This says to perform two floating-point multiplications. It would be
nice if the optimizer could translate it to the following.
y = (x - y) * z;
This has only one multiplication, so it would be substantially faster than before. Unfortunately, the two aren't equivalent, because the distributive law doesn't apply to floating-point arithmetic. The programmer might know that it is close enough, but the compiler can't safely assume it.
A more dramatic example is the following: Suppose the programmer writes the following.
y = f(x) * f(x);
Here, the program says to call f() twice, passing the same
argument each time.
This second function call with the same argument seems like wasted time,
especially if f() takes a lot of time.
It would be nice if the compiler could do the following
instead.
y = f(x);
y = y * y;
Unfortunately, the compiler can't perform this translation,
because f()
may have some side effects — it may use a global variable, or it may
print something for the user to see.
Again, we as programmers may know that there are no side effects, but
the compiler cannot assume it. Thus, as a programmer, we should perform such a
translation ourself, rather than rely on the compiler to do it
for us.
In most cases, though, compilers do a good enough job that worrying about writing efficient code isn't important, since the compiler will much of what you write anyway. Because of the tremendous expense of debugging time, it's better to place the top priority on writing clean, clear code. Only if you observe there is an efficiency problem should you expend effort in sacrificing clarity for efficiency; and even then, you should try to sacrifice as little clarity as possible to achieve the efficiency required.
