Chapter 14. Caches
Modern computing systems use memory very heavily, and so memory
access time is very important. In this chapter, we investigate how
computers reduce memory access time using caches.
14.1. Memory hierarchy
Today's computing systems includes three important technologies for
remembering information.
disks, which use magnetic material that can hold magnetic charge
without a power source. Although some disks are portable, like the old
portable disks, in this book we'll use the word disk to refer
to a computer's hard drive.
RAM (for Random Access Memory) uses electrical capacitors for
holding data. More properly, this technique of storing memory is called
DRAM (Dynamic RAM). A capacitor is two nearby plates; it can hold an
electronic charge for a brief instant. Storing a bit long-term requires
the ability to recharge each depleted capacitor many times a second. Of
course, you only want to recharge the capacitors that held values
previously.
[Manufacturers often distinguish between varieties
of DRAM. The S in SDRAM, for example, stands for
synchronous.]
SRAM (Static RAM; RAM stands for Random Access Memory)
is built using transistors, arranged in a way that is similar to that of
an SR latch.
To get a feeling for the relative uses of these technologies, we can
look at a table.
| access time | cost per MB |
| SRAM | 2 | nanoseconds | $100.00 |
| DRAM | 60 | nanoseconds | $1.00 |
| disk | 8 | milliseconds | $0.04 |
(All numbers are from 2000)
In comparison, a CPU will try to complete an instruction every
clock cycle; if we have a 2GHz CPU, that means there are 2 billion
clock cycles each second, which means 0.5 ns (nanoseconds) per
clock cycle.
Thus, SRAM can nearly keep up, while DRAM is significantly slower, and
the disk is slothlike. But the price of SRAM is so huge that we cannot
place much of it in a computer.
To talk about computing systems intelligently, you need to know your
scientific units! Below is a table of the most important units.
| milli- | (m) | one thousandth |
|
kilo- | (K) | one thousand |
| micro- | (μ) | one millionth |
|
mega- | (M) | one million |
| nano- | (n) | one billionth |
|
giga- | (G) | one billion |
|
tera- | (T) | one trillion |
More extreme measures occassionally arise. It's not as important that
you remember them, but several are listed in the below table.
The larger units come from mutating penta-,
hexa-, septa-, and octa-.
| pico- | (p) | 10−12 (one trillionth) |
| femto- | (f) | 10−15 |
|
peta- | (P) | 1015 |
|
exa- | (E) | 1018 |
|
zetta- | (Z) | 1021 |
|
yotta- | (Y) | 1024 |
Over the last several decades, CPU speed has improved dramatically.
Memory speed has also improved, but not as quickly.
Thus, memory has become an important bottleneck, to the extent that it
is the major factor limiting the performance of many programs.
Computing systems improve memory performance through the use of
cache, where we use SRAM to hold the data that is
likely to be used often in the near future. This uses the fact that
programs often have a strong sense of locality.
The principle of locality states that if a
program accesses a particular memory location, it is likely to access a
nearby memory location in the near future.
This has been tested numerous times, and invariably programs display
this property in abundance. We just don't see programs accessing random
memory locations often.
Most modern computers have at least four tiers for
holding memory.
Level 1 (L1) cache, implemented with SRAM, is a small cache built
onto the CPU chip. A typical L1 cache size is 32KB.
Level 2 (L2) cache, also implemented with SRAM, is also on the
CPU chip. It can include more memory, perhaps 256KB.
However, it is not placed on the CPU chip quite as close to the central
processor, and it fetching memory requires traveling through more gates
because it is larger. Moreover, chips including multiple cores (that is,
multiple processing units), a single L2 cache is
typically shared by all cores. All this leads the L2 cache to be
somewhat slower than the L1 cache.
RAM, implemented with DRAM, located on chips outside the
processor. A typical RAM size today might be 2GB.
hard disk, implemented with magnetic technology, which could hold
more address space. Of course, the disk is much larger than that, but
only a small portion of it is dedicated to virtual memory,
which stores data within the memory address space.
Each tier holds a subset of what is held in the next tier.
There are occassional exceptions: Some processors
are designed so that if something is in the L1 cache, it is not repeated
in the L2 cache. Thus, when x is loaded from L2 into L1,
whatever is removed from L1 to make room is swapped into the old
location of x within L2.
For most processors, though, their designers found that
this bit of extra capacity isn't worth the extra effort.
The L1 and L2 caches are built transparently into the system: There
is usually no way for a programmer to explicitly manipulate what is
stored in the caches. Instead, the processor itself decides what to
place into each cache.
14.2. Direct-mapped cache
There are three basic categories of caches: direct mapped,
set associative, and fully associative. We'll begin by
examining direct-mapped caches.
14.2.1. Structure
A direct-mapped cache is broken into lines of
several bytes for holding cached data. (Thirty-two bytes would be a
typical line size.) Each line is numbered starting at 0, and
associated with it is
some information, including a valid bit for
representing whether the line holds any useful data and a
tag to identify what real memory address the cache
holds.
A realistic direct-mapped cache might have
512 lines,
each holding 64 bytes, for a total of 32 KB.
As we'll see later, each line contains a
valid bit, a tag, and the 64 bytes of data.
| line | valid | tag | data |
| 0 | ? | ??? | ??? |
| 1 | ? | ??? | ??? |
| 2 | ? | ??? | ??? |
| : | : | : | : |
| 510 | ? | ??? | ??? |
| 511 | ? | ??? | ??? |
In such a system, each 32-bit memory address is considered to have
three pieces, including 17 bits for the tag, 9 bits for the
line number (since 29 is 512), and 6 bits
for the offset within the line (since 26 is
64).

To find a memory item in the cache, the direct-mapped cache breaks
the desired memory address into these three pieces,
tag, line, and offset.
If the memory item is in the cache, it must be offset bytes
into the line numbered line.
To determine whether the data is present,
the cache will verify that this line's valid bit is 1 and that the
tag found in the requested address matches the tag stored
in the line.
If this test reveals that the data is in the cache, the cache can
respond immediately. If not, the cache loads the line containing the
requested address into the memory, sets the line's valid bit and tag,
and then returns the requested data.
14.2.2. Example
To get a better handle on how this works, let's look at an example.
Suppose we're working in a computer using 6-bit memory addresses, and
our computer has a cache with two lines, each holding
four bytes.
Thus, of each 6-bit memory address,
one bit will specify the line number and two will specify
the offset within the line; that leaves
6 − 1 − 2 = 3
bits for the tag.

The cache looks like the following at start-up
time.
| line | valid | tag | data |
| 0 | 0 | ??? | (none) |
| 1 | 0 | ??? | (none) |
Let's see what happens if a program accesses the following sequence
of bytes from memory:
M[19], M[20], M[16], M[34], M[18].
Read M[19].
We break 010011 into its component pieces,
getting 010 for the tag, 0 for the line number, and 11 for the offset.
We look at line 0, find that the valid bit is 0; we conclude the
information isn't yet in the cache, calling it a cache
miss.
On a cache miss, the cache
loads all bytes with the matching tag and line number into the
cache. Thus, we'll load from all memory address of the form 0100xx into
line 0; this includes memory addresses 16 through 19.
| line | valid | tag | data |
| 0 | 1 | 010 | M[16–19] |
| 1 | 0 | 010 | M[20–23] |
(Despite our diagram, the memory addresses 16, 17, 18, and 19 don't
actually appear in the cache. The data column actually contains
the 32 bits found in those memory addresses.)
Read M[20].
We break 010100 into its component pieces,
getting 010 for the tag, 1 for the line number, and 00 for the offset.
We look at line 1, find that the valid bit is 0, and so this is a cache
miss.
We load all memory addresses of the form 0100xx into
line 1.
| line | valid | tag | data |
| 0 | 1 | 010 | M[16–19] |
| 1 | 1 | 010 | M[20–23] |
Read M[16].
We break 010000 into its component pieces,
getting 010 for the tag, 0 for the line number, and 00 for the offset.
We look at line 0, find that the valid bit is 0 and the line's tag
matches the address's tag. Thus we have the desired information
already in cache; this is called a cache hit.
We serve the request using the information
starting at offset 00(2) = 0
in the cache line.
Read M[34].
We break 100010 into its component pieces,
getting 100 for the tag, 0 for the line number, and 10 for the offset.
We look at line 0, find that the valid bit is 1 but the line's tag 010
doesn't match the address's tag 100.
We discard the information currently in line 0, replacing it with
memory addresses 32–35.
| line | valid | tag | data |
| 0 | 1 | 100 | M[32–35] |
| 1 | 1 | 010 | M[20–23] |
Read M[18].
We break 010010 into its component pieces,
getting 010 for the tag, 0 for the line number, and 10 for the offset.
We look at line 0, find that the valid bit is 1 but the line's tag 100
doesn't match the address's tag 010.
We discard the information currently in line 0, replacing it with
memory addresses 16–19.
| line | valid | tag | data |
| 0 | 1 | 010 | M[16–19] |
| 1 | 1 | 010 | M[20–23] |
Thus, in serving this sequence of five memory accesses, we found one
cache hit and four cache misses.
14.2.3. Analysis
The primary thing to consider about a cache is how well it takes
advantage of locality. There are two things to consider.
- How does it handle subsequent accesses to two locations that are
somewhat close?
Usually this goes quite well.
In our example trace, when we load M[0-3], we also load M[4-7], because
this in the same line. Thus, when we access M[4-7] on the next step, it
is already in the cache.
This question is particularly important when you consider array
accesses. Accessing an array in order is a particularly common
operation.
For example, to add up the elements of an array, we would do the
following.
sum = 0;
for(i = 0; i < n; i++) {
sum += a[i];
}
If each line held 32 bytes, and each array element had 4 bytes, then
this loop would miss the cache only once every
32 / 4 = 8 times.
The way the cache manages this performance is by loading up an entire
line at a time.
- How does it handle two accesses to the same location that are
somewhat close together in time?
A direct-mapped cache usually does this well, since it will hold the
line until an access to some distant memory that happens to map to the
same line occurs.
The direct-mapped cache does, however, have a very bad case. Suppose,
in our example, that we have a program that access M[36] and M[0]
in alternation. Each time, line 0 would be replaced, and so each memory
access would miss the cache.
This is unfortunate, because frequent accesses to the same memory
locations should be an easy case for the cache.
This is a shortcoming of direct-mapped caches.
Direct-mapped caches identify the tag first in the address
and the line number second to reduce the likelihood of this problem.
If they split an address by placing the line number first, then
two close memory addresses that were only 32 bytes apart would map to
the same line, and conflicts would be much more likely.
By splitting an address with the tag first,
conflicting addresses are further apart, but this does not avoid the
issue entirely.
14.3. Other cache types
Direct-mapped caches are only one of three basic categories of cache.
The others are set associative and fully associative.
After you understand direct-mapped caches, they're relatively easy to
understand.
14.3.1. Fully associative
As we have seen with direct-mapped caches, having only one location
in the cache where you can store data for a particular address can
lead to some real problems.
A fully associative cache releases this constraint.
The issue of line numbers is entirely irrelevant;
the only two parts of an address are the tag and the line offset.
Asked for memory at a particular address, the fully associative cache
searches through its current lines for one that contains the desired
tag, and it returns the data found in that line.
If it is not found, then it needs to make room for the requested line.
It will use an algorithm such as LRU or Clock to decide which line to
remove; we'll study LRU and Clock later, when we study virtual memory.
Fully associative caches are always more effective than
direct-mapped caches at avoiding cache misses.
CPU designers often use direct-mapped caches, however, because
fully-associative caches require more complex circuitry.
Thus, given a fixed budget of gates,
the designer can build a larger direct-mapped cache
than a fully associative cache.
The bad behavior of direct-mapped caches is rare,
the designer judges,
and so wasting gates on the possibility is a poor use of the limited
number of transistors that can be built into a CPU.
14.3.2. Set associative
The set-associative cache is a compromise between a direct-mapped
cache and a fully associative cache.
Here, we group the lines into sets.
Typical set sizes are four or eight lines per set;
a cache with four lines per set is called 4-way
set-associative.
Given an address, we break it into the desired tag, the set number, and
the offset. The set number says which set to look into, and the computer
must examine all tags in the set to see if any of them match the desired
tag.
As you would expect, set-associative caches require more logic gates
per line than a direct-mapped cache but less than a fully associative
cache. On the other hand, it avoids the bad behavior of a direct-mapped
when a program repeatedly accesses two memory locations that happen to
map to the same line. It would still have a problem if a program
accesses several different memory locations mapping to the same set
— for example, with a four-way set associative
cache, cycling through five different memory addresses in the same set
would cause problems. But such a case is relatively rare.
14.3.3. Real life
To get a feel for how caches are in real life,
we can look at the Intel Pentium D released in early 2006
The Intel Pentium D incorporates three caches.
An L1 i-cache for holding instructions.
The Pentium D has separate caches for instructions and data, since
these two types of data are normally disjoint, and since it frequently
wants to load from each cache in the same clock cycle.
The details of the instruction cache are difficult to describe, because
the Pentium D actually translates each IA32 instruction into one or
more instructions from a simpler instruction set before executing it.
The cache holds these translated instructions, so its size doesn't
correspond to real memory.
A 16 KB L1 d-cache for holding data. This cache is 4-way
set-associative, and it has 64 bytes per line.
A 2 MB L2 cache. This cache is 8-way set-associative, and it has
128 bytes per line. The Pentium D has two cores; each core has its
own pair of L1 caches and its own L2 cache.
Designers tinker with the cache configuration with each new processor
design. As another example, let's look at Intel's Nehalem architecture,
released in late 2008.
A 32 KB L1 i-cache, which is 4-way set-associative.
A 32 KB L1 d-cache, which is 8-way set-associative.
A 256 KB L2 cache, which is 8-way set-associative.
An L3 cache, which is 16-way set-associative.
The cache varies between specific Nehalem processors, but 8 MB is a
representative size.
The Nehalem architecture is designed for up to eight cores;
all cores share a single L3 cache.
14.4. Issues with writing
So far, we've dealt only with reading from memory. But programs also
write to memory, which brings up some interesting issues for caches.
14.4.1. Write-back vs. write-through
Should caches cache writes also?
If we answer no, then we have what is called a
write-through cache;
in such a cache, each request to write information is stored in the
cache and also forwarded onto memory.
If we answer yes, then we have a
write-back cache;
in such a cache, each line has a dirty bit,
and before replacing the data in a line, the cache must save its
data if its dirty bit is set.
Generally, write-back caches are more efficient, but they require more
circuitry. CPU designers have to decide whether they feel that the
benefit is worth spending circuitry on it. Basically, it boils down to
whether a single write is a strong indication that somebody will write
to the same line in the near future. Whether this is true depends on
what sorts of programs run on the computer.
Remember that compilers work very hard to avoid writes to memory.
Thus, if a program compiled by a compiler writes to a particular memory
location, you can bet that the compiler will try to avoid code to
write to the same memory location in the near future.
Thus write-through caches may make some sense.
On the other hand, if we're setting all the elements of an array,
then it's quite likely that other places in the line will be changed in
the near future. So, if we think this will happen often, we'll think
that write-back caches make more sense.
In general, with the increasing amount of logic that can be placed on
the chip, additional logic becomes less expensive. Thus, designers have
an increasing tendency to choose write-back caches for their
designs.
14.4.2. Write-allocate vs. no-write-allocate
If somebody writes to something that's not in the cache, should
we cache that location? If we answer yes, then we have a
write-allocate cache; otherwise, we have a
no-write-allocate cache.
Here, the decision is somewhat more mixed. I'd say that generally
write-allocate makes more sense in conjunction with write-back caches,
since a write to a location likely indicates that a change to a nearby
location is likely in the future.
But a write to a nearby location doesn't have a benefit with
a write-through cache; thus, the benefit is less clear in this
situation.
14.5. Performance issues
Recall the operation of matrix multiplication:
If A and B are the two matrices to multiply, and
we want to determine entry (i, j) of the product
matrix C, we should sum the products of the corresponding
numbers of the ith row of A and the jth
column of B.
Ci, j
= ∑k = 1n
Ai, k
⋅
Bk, j
The following is an example of matrix multiplication.
In this example, the (0, 1) entry in the result matrix,
whose value is 7,
comes from adding together
corresponding products of the row 0 of the first matrix
<3,0,1>
and column 1 of the second matrix
<2,0,1>:
3 ⋅ 2 + 0 ⋅ 0 + 1 ⋅ 1
= 7.
14.5.1. The i–j–k implementation
The following C code is a straightforward translation of the
algorithm.
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
for(k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
It's reasonable to ask: How does caching affect this? To analyze this,
we'll look at the innermost for loop and for each memory
access, we'll ask: How often does this memory access miss
the cache on average? We'll assume that we're dealing with very large
matrices, and so n is very large
Let's look at the access to C[i][j] first.
As we iterate through the k's, this address doesn't change.
Thus, C[i][j] won't be in the cache for the first iteration of
the innermost for loop, but it will be in the cache thereafter.
Thus, the average number of cache misses per iteration is 1/n,
which is basically 0, since we're assuming n is large.
The accesses to A[i][k] go through a row of the array.
Adjacent elements of the row are stored in adjacent elements of memory.
Thus, when A[i][k] is loaded into the cache, the elements
that fall on the same cache line will be in the same row.
If we are dealing with a matrix of doubles,
then each array element uses the 64-bit IEEE format,
so there are 8 bytes per array element.
If our cache has lines of 64 bytes, then there are eight array
elements per line.
Thus, when we load an array element into a cache line,
we also load seven other array elements around it.
As we go through the A[i][k],
we'll miss the cache once every eight times through the loop.
Thus, the average number of cache misses per iteration is 0.125.
Finally, let's look at the accesses to B[k][j].
Here, we're going through a column, and each element is 4n
bytes beyond the previous element in the column. We'll never find
B[k][j] in the cache unless it was loaded during the iteration
through the previous row. But, if n is large enough, the bottom
of the previous column will flush from the cache anything cached during
the top of the previous column. And by the time we get to the bottom of
the current column, the top of that column would flush from the cache
anything that might have been loaded in the previous column. Thus,
we'll miss the cache on each access to B[k][j], and so there
are on average 1.00 cache misses for B.
Let's tabulate these results.
Average misses per iteration of innermost loop
A[i][k] misses | 0.125 |
B[k][j] misses | 1.000 |
C[i][j] misses | 0.000 |
| total misses | 1.125 |
On average, we'll miss the cache 1.125 times every time through the
loop.
14.5.2. The k–i–j implementation
Now let's consider a slightly different implementation. The only
thing that has changed here is that we have placed the k loop
first, so that the innermost loop iterates through the j.
The code computes exactly the same matrix as before.
for(k = 0; k < n; k++) {
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
If we think about how many misses we have here, we see that the
access to C[i][j] misses on average every 0.125 times through the
inner loop over j (since we're going through a single row of
C), the access to A[i][k] does not miss the cache,
and the access to B[k][j] misses on average every 0.125 times
through the inner loop.
Average misses per iteration of innermost loop
A[i][k] misses | 0.000 |
B[k][j] misses | 0.125 |
C[i][j] misses | 0.125 |
| total misses | 0.250 |
If cache misses were all that mattered, we would expect this
alternative implementation to take as little as 20% as much time as
our first
i–j–k
implementation.
This may not be the entire story, since the second implementation
both reads and writes to C[i][j] each time through the loop,
while the accesses to C[i][j] in the first implementation could
be factored out entirely. Thus, it wouldn't be entirely surprising if
the second implementation were actually slower.
14.5.3. Tests
We can perform some tests to see what happens.
The following table reports the results of some tests
performed on a Pentium D chip.
The numbers represent the number of nanoseconds per iteration of the
inner loop. That is, the number is the total amount of time
spent on the multiplication divided by n³.
| n |
i–j–k |
k–i–j |
| 32 |
2.94 |
4.47 |
| 64 |
3.03 |
4.02 |
| 128 |
2.84 |
3.85 |
| 256 |
5.96 |
3.95 |
| 512 |
20.04 |
3.80 |
For small n, we don't expect to see a huge difference,
and that's reflected in our numbers: If the matrix doesn't have many
rows, then a full column will fit into the cache. As a result, when we
go through the inner loop for one value of j in our
i–j–k implementation, the cache will
load the entire jth column of B, and for the next value of
j the cache will find every entry of B.
But for larger n, an entire column will not fit into the
cache. And, indeed, we see that
i–j–k
becomes much slower when n reaches 512.
That the jump occurs at 512 isn't surprising: The Pentium D's
L1 cache has 256 lines.
Indeed, we see that when n is 512, the time for
is roughly 5 times the time for
k–i–j,
as predicted by our analysis.