Compare the virtues of a direct-mapped cache versus a fully associative cache.
With a direct-mapped cache, in which every memory address can be cached in only one location within the cache, there is the possibility that an access to a distant memory address would eject a very recent access that will likely be used soon in the future. A fully associative cache doesn't share this occassional poor behavior, but it requires more logic circuitry per cache line, so that large and fast fully associative cache are unaffordable.
Why do many CPUs implement a set-associative cache rather than the direct-mapped cache? Why use a set-associative cache rather than the fully associative cache?
A direct-mapped cache has the disadvantage that if a program frequently access memory locations that happen to map to the same cache line, then these memory accesses will frequently miss. This is much less likely to happen with a set-associative cache.
With a fully associative cache, such cache misses are even less likely, but fully associative increases the circuit complexity, and set-associative caches work well enough in practice that reducing misses by using a fully associative cache instead is hardly worth the added circuit complexity.
Describe the difference between the write-through and write-back approaches to handling writes in a cache, and contrast their relative advantages.
In a write-through cache, each write is sent immediately into the underlying memory; whereas in a write-back cache, writes are not sent to underlying memory until the corresponding line is ejected from the cache.
Because the write-through approach is simpler than the write-back approach, it requires fewer transistors to support. But a write-back cache will result in less traffic with memory. [Also, in multi-processor systems with separate caches, the write-through approach is particularly advantageous.]
Which of the below C fragments is likely to execute more quickly? Explain your answer in detail.
| a. | b. |
sum = 0; | sum = 0; |
Fragment a. is likely to execute more quickly because it maps better to cache: It goes through the two-dimensional array at successive memory addresses, so a cache miss for one array entry will lead to several successive cache hits at subsequent array entries. By contrast, fragment b. goes through the array column by column; loading one entry in the column will not lead the next column entry to be simultaneously loaded.
Consider the following C code to multiply two matrices.
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
Note that the innermost loop here iterates over
k.
Suppose that our CPU has a direct-mapped cache with 128 lines, each holding 64 bytes, and suppose that each array element takes 8 bytes. On average, how many times per iteration through the innermost loop will the CPU miss the cache in accessing each of the following?
| a. | C[i][j] |
| b. | A[i][k] |
| c. | B[k][j] |
| a. | 0 (This is the same memory location each time through the innermost loop, so there will be no misses after the first iteration.) |
| b. | 0.125 (As we go through the innermost loop, we are walking through successive memory locations. Each time we miss the cache, we will load eight successive locations in the row; thus this access misses the cache one time in eight.) |
| c. | 1.0 (Each array access in in a different row, so they will be widely separated. Loading one element in the column will not help in reaching another. Thus each access to B will be a miss.) |
Each of the following two code fragments computes the sum of list elements found in a two-dimensional array used to represent a linked list. Each node of the list consists of an integer datum and an integer giving the index of the next node within the array (or −1 if the node is the final node).
int list[10000][2]; | int list[2][10000]; |
In view of caching effects, which will be faster? Explain your answer, with specific reference to how the cache works.
The fragment on left will be faster. With each iteration of the loop, this fragment accesses two successive memory locations, likely to occur on the same line in the cache. Consequently, there will be just one miss for each iteration.
With the fragment on right, the two memory accesses are far
apart and so will be on different lines; moreover, since n
will skip around throughout the array, there entries will
rarely be in the cache from previous iterations. As a result,
the fragment on right will typically see two cache misses per
iteration, and so it will take twice as long.
Tests on an Intel Pentium III processor reveal that the performance of
the below code depends heavily on the value of skip.
int sum(char *arr, int skip) {
int i, k, total;
total = 0;
for (k = 0; k < 4000; k++) {
for (i = k; i < 4000000; i += skip) {
total += arr[i];
}
}
return total;
}
skiptime 16,352 40 ms 16,384 480 ms
Surprisingly, with a larger value of skip,
for which the code will add fewer numbers,
the code takes twelve times as long.
Explain how the cache is making a difference here. (A Pentium III's L1 cache is 4-way set associative, with 512 lines, each holding 32 bytes. Note that the cache holds 512 × 32 = 16,384 bytes.)
When skip is 16,384, each access to arr[i] in the
inner loop maps to the same set of cache lines,
the lines accessed at the beginning of the inner loop will
be dropped from the cache as the inner loop continues.
When k is incremented and the program reads the
array entries following those accessed in the previous loop,
those lines aren't available in the cache.
As a result, all array accesses miss the cache.
When skip is 16,352, then the accesses to
arr[i] map to a variety of cache lines, and all the requested
lines can fit into the cache. (The code would access
4000000/16362 ≈ 244 lines, and since they map to different sets,
they would all fit into the cache.)
When the code continues to the next value of k,
the data is still in the cache from before, and so the
array accesses usually hit the cache.