Chapter 14. Caches

Modern computing systems use memory very heavily, and so memory access time is very important. In this chapter, we investigate how computers reduce memory access time using caches.

14.1. Memory hierarchy

Today's computing systems includes three important technologies for remembering information.

To get a feeling for the relative uses of these technologies, we can look at a table.

access
time
cost
per MB
SRAM2nanoseconds$100.00
DRAM60nanoseconds$1.00
disk8milliseconds$0.04
(All numbers are from 2000)

In comparison, a CPU will try to complete an instruction every clock cycle; if we have a 2GHz CPU, that means there are 2 billion clock cycles each second, which means 0.5 ns (nanoseconds) per clock cycle. Thus, SRAM can nearly keep up, while DRAM is significantly slower, and the disk is slothlike. But the price of SRAM is so huge that we cannot place much of it in a computer.

Aside

To talk about computing systems intelligently, you need to know your scientific units! Below is a table of the most important units.

milli-(m)one thousandth      kilo-(K)one thousand
micro-(μ)one millionth mega-(M)one million
nano-(n)one billionth giga-(G)one billion
tera-(T)one trillion

More extreme measures occassionally arise. It's not as important that you remember them, but several are listed in the below table. The larger units come from mutating penta-, hexa-, septa-, and octa-.

pico-(p)10−12 (one trillionth)
femto-(f)10−15      peta-(P)1015
exa-(E)1018
zetta-(Z)1021
yotta-(Y)1024

Over the last several decades, CPU speed has improved dramatically. Memory speed has also improved, but not as quickly. Thus, memory has become an important bottleneck, to the extent that it is the major factor limiting the performance of many programs.

Computing systems improve memory performance through the use of cache, where we use SRAM to hold the data that is likely to be used often in the near future. This uses the fact that programs often have a strong sense of locality. The principle of locality states that if a program accesses a particular memory location, it is likely to access a nearby memory location in the near future. This has been tested numerous times, and invariably programs display this property in abundance. We just don't see programs accessing random memory locations often.

Most modern computers have at least four tiers for holding memory.

Each tier holds a subset of what is held in the next tier. There are occassional exceptions: Some processors are designed so that if something is in the L1 cache, it is not repeated in the L2 cache. Thus, when x is loaded from L2 into L1, whatever is removed from L1 to make room is swapped into the old location of x within L2. For most processors, though, their designers found that this bit of extra capacity isn't worth the extra effort.

The L1 and L2 caches are built transparently into the system: There is usually no way for a programmer to explicitly manipulate what is stored in the caches. Instead, the processor itself decides what to place into each cache.

14.2. Direct-mapped cache

There are three basic categories of caches: direct mapped, set associative, and fully associative. We'll begin by examining direct-mapped caches.

14.2.1. Structure

A direct-mapped cache is broken into lines of several bytes for holding cached data. (Thirty-two bytes would be a typical line size.) Each line is numbered starting at 0, and associated with it is some information, including a valid bit for representing whether the line holds any useful data and a tag to identify what real memory address the cache holds.

A realistic direct-mapped cache might have 512 lines, each holding 64 bytes, for a total of 32 KB. As we'll see later, each line contains a valid bit, a tag, and the 64 bytes of data.

linevalidtagdata
0 ? ??????
1 ? ??????
2 ? ??????
: : : :
510 ? ??????
511 ? ??????

In such a system, each 32-bit memory address is considered to have three pieces, including 17 bits for the tag, 9 bits for the line number (since 29 is 512), and 6 bits for the offset within the line (since 26 is 64).

To find a memory item in the cache, the direct-mapped cache breaks the desired memory address into these three pieces, tag, line, and offset. If the memory item is in the cache, it must be offset bytes into the line numbered line. To determine whether the data is present, the cache will verify that this line's valid bit is 1 and that the tag found in the requested address matches the tag stored in the line. If this test reveals that the data is in the cache, the cache can respond immediately. If not, the cache loads the line containing the requested address into the memory, sets the line's valid bit and tag, and then returns the requested data.

14.2.2. Example

To get a better handle on how this works, let's look at an example. Suppose we're working in a computer using 6-bit memory addresses, and our computer has a cache with two lines, each holding four bytes. Thus, of each 6-bit memory address, one bit will specify the line number and two will specify the offset within the line; that leaves 6 − 1 − 2 = 3 bits for the tag.

The cache looks like the following at start-up time.

linevalidtagdata
00???(none)
10???(none)

Let's see what happens if a program accesses the following sequence of bytes from memory: M[19], M[20], M[16], M[34], M[18].

  1. Read M[19]. We break 010011 into its component pieces, getting 010 for the tag, 0 for the line number, and 11 for the offset. We look at line 0, find that the valid bit is 0; we conclude the information isn't yet in the cache, calling it a cache miss. On a cache miss, the cache loads all bytes with the matching tag and line number into the cache. Thus, we'll load from all memory address of the form 0100xx into line 0; this includes memory addresses 16 through 19.

    linevalidtagdata
    01010M[16–19]
    10010M[20–23]

    (Despite our diagram, the memory addresses 16, 17, 18, and 19 don't actually appear in the cache. The data column actually contains the 32 bits found in those memory addresses.)

  2. Read M[20]. We break 010100 into its component pieces, getting 010 for the tag, 1 for the line number, and 00 for the offset. We look at line 1, find that the valid bit is 0, and so this is a cache miss. We load all memory addresses of the form 0100xx into line 1.

    linevalidtagdata
    01010M[16–19]
    11010M[20–23]
  3. Read M[16]. We break 010000 into its component pieces, getting 010 for the tag, 0 for the line number, and 00 for the offset. We look at line 0, find that the valid bit is 0 and the line's tag matches the address's tag. Thus we have the desired information already in cache; this is called a cache hit. We serve the request using the information starting at offset 00(2) = 0 in the cache line.

  4. Read M[34]. We break 100010 into its component pieces, getting 100 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, find that the valid bit is 1 but the line's tag 010 doesn't match the address's tag 100. We discard the information currently in line 0, replacing it with memory addresses 32–35.

    linevalidtagdata
    01100M[32–35]
    11010M[20–23]
  5. Read M[18]. We break 010010 into its component pieces, getting 010 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, find that the valid bit is 1 but the line's tag 100 doesn't match the address's tag 010. We discard the information currently in line 0, replacing it with memory addresses 16–19.

    linevalidtagdata
    01010M[16–19]
    11010M[20–23]

Thus, in serving this sequence of five memory accesses, we found one cache hit and four cache misses.

14.2.3. Analysis

The primary thing to consider about a cache is how well it takes advantage of locality. There are two things to consider.

How does it handle subsequent accesses to two locations that are somewhat close?

Usually this goes quite well. In our example trace, when we load M[0-3], we also load M[4-7], because this in the same line. Thus, when we access M[4-7] on the next step, it is already in the cache.

This question is particularly important when you consider array accesses. Accessing an array in order is a particularly common operation. For example, to add up the elements of an array, we would do the following.

sum = 0;
for(i = 0; i < ni++) {
    sum += a[i];
}

If each line held 32 bytes, and each array element had 4 bytes, then this loop would miss the cache only once every 32 / 4 = 8 times. The way the cache manages this performance is by loading up an entire line at a time.

How does it handle two accesses to the same location that are somewhat close together in time?

A direct-mapped cache usually does this well, since it will hold the line until an access to some distant memory that happens to map to the same line occurs.

The direct-mapped cache does, however, have a very bad case. Suppose, in our example, that we have a program that access M[36] and M[0] in alternation. Each time, line 0 would be replaced, and so each memory access would miss the cache. This is unfortunate, because frequent accesses to the same memory locations should be an easy case for the cache. This is a shortcoming of direct-mapped caches.

Direct-mapped caches identify the tag first in the address and the line number second to reduce the likelihood of this problem. If they split an address by placing the line number first, then two close memory addresses that were only 32 bytes apart would map to the same line, and conflicts would be much more likely. By splitting an address with the tag first, conflicting addresses are further apart, but this does not avoid the issue entirely.

14.3. Other cache types

Direct-mapped caches are only one of three basic categories of cache. The others are set associative and fully associative. After you understand direct-mapped caches, they're relatively easy to understand.

14.3.1. Fully associative

As we have seen with direct-mapped caches, having only one location in the cache where you can store data for a particular address can lead to some real problems. A fully associative cache releases this constraint. The issue of line numbers is entirely irrelevant; the only two parts of an address are the tag and the line offset. Asked for memory at a particular address, the fully associative cache searches through its current lines for one that contains the desired tag, and it returns the data found in that line. If it is not found, then it needs to make room for the requested line. It will use an algorithm such as LRU or Clock to decide which line to remove; we'll study LRU and Clock later, when we study virtual memory.

Fully associative caches are always more effective than direct-mapped caches at avoiding cache misses. CPU designers often use direct-mapped caches, however, because fully-associative caches require more complex circuitry. Thus, given a fixed budget of gates, the designer can build a larger direct-mapped cache than a fully associative cache. The bad behavior of direct-mapped caches is rare, the designer judges, and so wasting gates on the possibility is a poor use of the limited number of transistors that can be built into a CPU.

14.3.2. Set associative

The set-associative cache is a compromise between a direct-mapped cache and a fully associative cache. Here, we group the lines into sets. Typical set sizes are four or eight lines per set; a cache with four lines per set is called 4-way set-associative. Given an address, we break it into the desired tag, the set number, and the offset. The set number says which set to look into, and the computer must examine all tags in the set to see if any of them match the desired tag.

As you would expect, set-associative caches require more logic gates per line than a direct-mapped cache but less than a fully associative cache. On the other hand, it avoids the bad behavior of a direct-mapped when a program repeatedly accesses two memory locations that happen to map to the same line. It would still have a problem if a program accesses several different memory locations mapping to the same set — for example, with a four-way set associative cache, cycling through five different memory addresses in the same set would cause problems. But such a case is relatively rare.

14.3.3. Real life

To get a feel for how caches are in real life, we can look at the Intel Pentium D released in early 2006 The Intel Pentium D incorporates three caches.

Designers tinker with the cache configuration with each new processor design. As another example, let's look at Intel's Nehalem architecture, released in late 2008.

14.4. Issues with writing

So far, we've dealt only with reading from memory. But programs also write to memory, which brings up some interesting issues for caches.

14.4.1. Write-back vs. write-through

Should caches cache writes also? If we answer no, then we have what is called a write-through cache; in such a cache, each request to write information is stored in the cache and also forwarded onto memory. If we answer yes, then we have a write-back cache; in such a cache, each line has a dirty bit, and before replacing the data in a line, the cache must save its data if its dirty bit is set.

Generally, write-back caches are more efficient, but they require more circuitry. CPU designers have to decide whether they feel that the benefit is worth spending circuitry on it. Basically, it boils down to whether a single write is a strong indication that somebody will write to the same line in the near future. Whether this is true depends on what sorts of programs run on the computer.

Remember that compilers work very hard to avoid writes to memory. Thus, if a program compiled by a compiler writes to a particular memory location, you can bet that the compiler will try to avoid code to write to the same memory location in the near future. Thus write-through caches may make some sense.

On the other hand, if we're setting all the elements of an array, then it's quite likely that other places in the line will be changed in the near future. So, if we think this will happen often, we'll think that write-back caches make more sense.

In general, with the increasing amount of logic that can be placed on the chip, additional logic becomes less expensive. Thus, designers have an increasing tendency to choose write-back caches for their designs.

14.4.2. Write-allocate vs. no-write-allocate

If somebody writes to something that's not in the cache, should we cache that location? If we answer yes, then we have a write-allocate cache; otherwise, we have a no-write-allocate cache.

Here, the decision is somewhat more mixed. I'd say that generally write-allocate makes more sense in conjunction with write-back caches, since a write to a location likely indicates that a change to a nearby location is likely in the future. But a write to a nearby location doesn't have a benefit with a write-through cache; thus, the benefit is less clear in this situation.

14.5. Performance issues

Recall the operation of matrix multiplication: If A and B are the two matrices to multiply, and we want to determine entry (ij) of the product matrix C, we should sum the products of the corresponding numbers of the ith row of A and the jth column of B.

Cij = ∑k = 1n AikBkj

The following is an example of matrix multiplication.

301
002
110
 × 
020
100
011
 = 
071
022
120

In this example, the (0, 1) entry in the result matrix, whose value is 7, comes from adding together corresponding products of the row 0 of the first matrix <3,0,1> and column 1 of the second matrix <2,0,1>: 3 ⋅ 2 + 0 ⋅ 0 + 1 ⋅ 1 = 7.

14.5.1. The ijk implementation

The following C code is a straightforward translation of the algorithm.

for(i = 0; i < ni++) {
    for(j = 0; j < nj++) {
        for(k = 0; k < nk++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

It's reasonable to ask: How does caching affect this? To analyze this, we'll look at the innermost for loop and for each memory access, we'll ask: How often does this memory access miss the cache on average? We'll assume that we're dealing with very large matrices, and so n is very large

Let's look at the access to C[i][j] first. As we iterate through the k's, this address doesn't change. Thus, C[i][j] won't be in the cache for the first iteration of the innermost for loop, but it will be in the cache thereafter. Thus, the average number of cache misses per iteration is 1/n, which is basically 0, since we're assuming n is large.

The accesses to A[i][k] go through a row of the array. Adjacent elements of the row are stored in adjacent elements of memory. Thus, when A[i][k] is loaded into the cache, the elements that fall on the same cache line will be in the same row. If we are dealing with a matrix of doubles, then each array element uses the 64-bit IEEE format, so there are 8 bytes per array element. If our cache has lines of 64 bytes, then there are eight array elements per line. Thus, when we load an array element into a cache line, we also load seven other array elements around it. As we go through the A[i][k], we'll miss the cache once every eight times through the loop. Thus, the average number of cache misses per iteration is 0.125.

Finally, let's look at the accesses to B[k][j]. Here, we're going through a column, and each element is 4n bytes beyond the previous element in the column. We'll never find B[k][j] in the cache unless it was loaded during the iteration through the previous row. But, if n is large enough, the bottom of the previous column will flush from the cache anything cached during the top of the previous column. And by the time we get to the bottom of the current column, the top of that column would flush from the cache anything that might have been loaded in the previous column. Thus, we'll miss the cache on each access to B[k][j], and so there are on average 1.00 cache misses for B.

Let's tabulate these results.

Average misses per iteration of innermost loop
A[i][k] misses0.125
B[k][j] misses1.000
C[i][j] misses0.000
total misses1.125

On average, we'll miss the cache 1.125 times every time through the loop.

14.5.2. The kij implementation

Now let's consider a slightly different implementation. The only thing that has changed here is that we have placed the k loop first, so that the innermost loop iterates through the j. The code computes exactly the same matrix as before.

for(k = 0; k < nk++) {
    for(i = 0; i < ni++) {
        for(j = 0; j < nj++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

If we think about how many misses we have here, we see that the access to C[i][j] misses on average every 0.125 times through the inner loop over j (since we're going through a single row of C), the access to A[i][k] does not miss the cache, and the access to B[k][j] misses on average every 0.125 times through the inner loop.

Average misses per iteration of innermost loop
A[i][k] misses0.000
B[k][j] misses0.125
C[i][j] misses0.125
total misses0.250

If cache misses were all that mattered, we would expect this alternative implementation to take as little as 20% as much time as our first ijk implementation.

This may not be the entire story, since the second implementation both reads and writes to C[i][j] each time through the loop, while the accesses to C[i][j] in the first implementation could be factored out entirely. Thus, it wouldn't be entirely surprising if the second implementation were actually slower.

14.5.3. Tests

We can perform some tests to see what happens. The following table reports the results of some tests performed on a Pentium D chip. The numbers represent the number of nanoseconds per iteration of the inner loop. That is, the number is the total amount of time spent on the multiplication divided by n³.

n ijk kij
32 2.94 4.47
64 3.03 4.02
128 2.84 3.85
256 5.96 3.95
512 20.04 3.80

For small n, we don't expect to see a huge difference, and that's reflected in our numbers: If the matrix doesn't have many rows, then a full column will fit into the cache. As a result, when we go through the inner loop for one value of j in our ijk implementation, the cache will load the entire jth column of B, and for the next value of j the cache will find every entry of B.

But for larger n, an entire column will not fit into the cache. And, indeed, we see that ijk becomes much slower when n reaches 512. That the jump occurs at 512 isn't surprising: The Pentium D's L1 cache has 256 lines. Indeed, we see that when n is 512, the time for is roughly 5 times the time for kij, as predicted by our analysis.