Chapter 17. OS: Virtual memory
Until now, we've imagined that each process is given its own segment of RAM to use. Such an approach — though straightforward and easy to implement — brings up some issues.
How does the operating system decide how much memory to allocate to each process?
What if the total memory required by active processes exceeds the amount of RAM available?
How can the system prevent a process from reading or altering the memory supposedly reserved for other processes?
Virtual memory addresses all of these issues, and several more besides. Virtual memory is an essential component of modern computing systems.
17.1. Basics
The idea of virtual memory is to create a virtual address space that doesn't correspond to actual addresses in RAM. The system stores the official copy of memory on disk and caches only the most frequently used data in RAM. To make this workable, we break virtual memory into chunks called pages; a typical page size is four kilobytes. We also break RAM into page frames, each the same size as a page, ready to hold any page of virtual memory.
The system also maintains a page table, stored in RAM, which is an array of entries, one for each page, storing information about the page. The most important piece of information in each page table entry is whether the corresponding page is loaded into a frame of RAM and, if so, which frame contains it. The CPU uses this page table in looking up data in virtual memory.
Consider the following illustration of a system using virtual memory.
Here, we have eight pages on disk pictured on the left and three page frames in the memory pictured at right. (This example is much smaller than in real life, where you would have thousands of pages and page frames.) In this example, each of the three frames holds a page from memory: The data of page 0 (in red) is in frame 1, page 2 (yellow) is in frame 3, and page 4 (blue) is in frame 2. You can see that the page table, also stored in RAM, contains this mapping from pages to frames.
17.1.1. Address translation
If each page held four kilobytes in our example illustrated above, each memory address would be 15 bits long. This is because the example includes eight 4K pages, for a total of 8 ⋅ 4K = 32K, and 215 bytes is 32 kilobytes.
Whenever a program requests access to a memory address, the CPU will always work with this as a virtual memory address, and it will need somehow to find where the data is actually loaded. The CPU goes through the following process.
The CPU breaks the address into the first three bits giving the page number page, and the last twelve bits giving the offset offs within the page.
The CPU looks into the page table at index page to find which page frame f contains the page.
If the page entry says that the page is not in RAM, it initiates a page fault. This is an exception telling the operating system that it needs to bring a page into memory. After the operating system's exception handler finishes, it returns back to the same instruction so the CPU ends up trying the instruction over again.
Otherwise, the CPU loads from the memory address offs within page frame f.
In this way, the CPU translates the virtual memory address into an actual memory address.
Suppose we were in the above-illustrated situation, and a program says to load some information from address 010 1001 1101 0011. The computer breaks this into two parts, the page number 010 and the offset 1001 1101 0011. It looks into the page table at index 010(2) = 2, finds 3, and replaces the page index with the page frame index to get the actual memory address, 11 1001 1101 0011(2). The CPU looks at this address in RAM to get the requested data.
Suppose that the program later says to load some information from address 110 1010 0000 1000(2). In this case, the page number is 110(2) = 6. Looking in the page table at entry 6, we see that the page is not anywhere in memory. Thus the CPU causes a page fault, giving the operating system a chance to load the desired page into a page frame and repair the page table. When the OS completes this process, it returns from the interrupt handler for the CPU to try the program's instruction again. This time, the CPU should be successful, since the OS has just loaded the required page into memory.
Notice the separation of responsibilities in virtual memory between the instruction set layer and the operating system layer. For efficiency, the system must be designed so that, in the regular course of execution, the CPU does not often require entering the operating system. Since memory references occur often, the CPU should be able to complete most memory references on its own. Thus, the job of translating virtual memory addresses into actual memory addresses falls to the CPU. Since the CPU will be looking into the page table to accomplish this, the instruction set designer is the one who will define the page table format.
Page faults are rare, however, and so the process of handling page faults need not be efficient. (This is especially true because the process of loading a page from disk is bound to be slow anyway, simply because disks are slow.) Additionally, the process for handling them is relatively complicated. Thus, the instruction set designer simply defines how a page fault is raised, and it leaves the details of how to accomplish the job of loading a page into memory to the operating system. Among other things, this means that different operating systems for the same CPU can have different approaches for storing virtual memory on disk and different approaches for deciding which loaded page to replace when another page is needed.
17.1.2. Page table format
The page table is the primary data structure for holding information about each page in memory. In a typical system, a page table entry will include the following information about each page.
a bit specifying whether that page is currently loaded into memory.
several bits dedicated to specifying which page frame contains the page (if indeed it is in memory). In our illustration with just three page frames, the entry would require 2 bits for this purpose. Actual systems would be able to accommodate many more pages, and twenty bits would be more realistic.
a bit, called the dirty bit, specifying whether the page in memory has been altered since being loaded into memory. This bit is useful because, when it is time to remove the page from memory, the OS must to store it back to disk if it has changed at all (that is, if it is
dirty
). Storing something on the disk takes additional time, so we only want to do this when necessary; the dirty bit allows the OS to avoid the cost when possible.a bit, called the referenced bit, that the CPU sets each time the page is accessed. The OS will reset this to 0 periodically. When it is 1, then, this indicates that the page has been accessed recently. The OS might use this information in deciding which page frame should be emptied to make room for a needed page.
17.2. Replacement algorithms
One of the most important issues in virtual memory is the paging algorithm. For the success of a virtual memory system, we need an algorithm that minimizes the number of page faults on typical request sequences while simultaneously requiring very little computation. Researchers have investigated many algorithms for this, and we'll examine a sample.
17.2.1. LRU (Least recently used)
The LRU algorithm says that the system should always eject the page that was least recently used. A page that has been used recently, after all, is likely to be used again in the near future, so we should not eject such a page.
LRU is about the best people have found for avoiding page faults. People would use it universally, were it not also much less efficient computationally than FIFO. There are basically two alternatives for implementation.
Linked list: The system maintains a doubly linked list among the page frames. For any access to the page inside a frame, the system moves that frame's node to the front of the list. Later, when a fault occurs, the system chooses to eject from the frame at the list's end.
This technique suffers from major updating costs for memory access: Moving a node to a list's front requires modifying at least four pointers. Thus, each memory request by a program requires four additional memory modifications (in addition to the extra page table access already necessary to determine the page's location in memory). These extra memory references add up to an unacceptable expense.
Counter: The system associates a value for each page frame and it keeps a counter clocking the time. When the system accesses the page in a frame, it also copies the current clock into the frame's associated value. Later, when a fault occurs, the system goes through all frames and empties the one whose value is least.
While this improves on the update cost per access, it dramatically increases the cost of selecting a page to eject: The OS must go through all page frames in order to determine the smallest value. Remember that a typical system may have many page frames (perhaps 100,000), so this cost is not small.
Thus, although LRU does well in avoiding page faults, people look for other algorithms because of its computational expense.
17.2.2. Clock
The Clock algorithm is a hybrid between the FIFO and the LRU algorithms. In this, the system maintains a referenced bit for each page, which is set by the CPU each time the page is accessed, and a counter of which page was last ejected. When it is time to eject a page, the OS repeatedly increments the counter until it finds a page whose referenced bit is 0; this is the page that ejected. As it increments the counter, it clears the referenced bit associated with each page passed. [The name Clock comes from the fact that the counter iterates through pages like a clock's hand. Of course, this is true of FIFO too, but FIFO sounds better.]
Note that, for each page access, the Clock algorithm requires the CPU to set a single bit. This is an additional expense, but it requires only a single memory access. (FIFO has no such expense, but it does not take a page's use into consideration.)
The Clock algorithm is somewhat worse than LRU and appreciably better than FIFO at avoiding page faults. Its computational expense is less than LRU's and more than FIFO's. For virtual memory systems, it has proven an excellent compromise between the two. Many operating systems base their page replacement algorithm on the Clock algorithm.
17.3. Implementation issues
Incidentally, virtual memory is one of the most prominent pieces of
evidence in favor of a famous aphorism in computer science coined by
David Wheeler: Any problem in computer science can be solved with
another layer of indirection.
After all, virtual memory solves a wide
variety of problems, and it does this by separating the memory addresses
referenced by a program from the actual RAM memory addresses.
But Wheeler followed his aphorism with a warning: But that
usually will create another problem.
And so it proves with virtual
memory.
Because of the tremendous slowness of the disk relative to memory and the CPU, avoiding page faults is crucial to the success of a virtual memory system. Virtual memory systems, equipped with good algorithms and a large cache, have proven to be able to avoid many page faults successfully in realistic page request scenarios. (If programs accessed memory completely at random, then this would not be possible. Luckily, such programs are rare.) Thus, the slowness of the disk does not prove to impede the viability of virtual memory systems.
Beyond this, however, there are other concerns regarding virtual memory deserving attention.
The address translation process must be supported by hardware, and this leads to a more complex CPU.
Paging can significantly slow down memory accesses. Since the page table is large, it typically must reside in regular memory. But this means that the CPU must now access memory twice for every memory reference — once to get the page table entry, and another time to get the actual data. Thus, in a straightforward implementation, virtual memory halves memory access time. (And that doesn't count the time to alter the referenced bit and/or dirty bit, which would require an additional memory access.)
The page table can be very large. For example, if a computer supports a full 32-bit address space (4GB), and each page is 1KB, then the page table itself must have 4M entries. Since each entry must have at least four bytes to hold the page frame index and other table entry data, this consumes a total of 32MB of memory. The page table needs to be permanently in memory — paging the page table doesn't work, since the CPU needs a fixed location to look up the page frame containing any page. Spending 32MB of memory for bookkeeping the rest of memory is an onerous penalty.
Luckily, these issues can be addressed with additional effort. The first is relatively minor, since the benefits of virtual memory are substantial enough that added complexity is worth it. The other two issues — speed and page table size — are more interesting, and we'll look at some techniques for addressing each.
17.3.1. Speed: The TLB
The easiest way to resolve the speed issue would be to move the page table onto the CPU, thereby reducing the expense of accessing the page table dramatically. But the large size of the page table makes this impossible. Thus, the CPU merely caches page table entries on the chip, in an area called the translation lookaside buffer, or TLB. The CPU can manage this cache transparently, so that the operating system and certainly user programs need not worry about it.
For the CPU designer, choosing the number of entries cached in the TLB
is a choice of whether spending transistors for an additional cache
entry will increase the overall system efficiency.
(CPU designers work with a transistor budget
— technology and cost
considerations limit them to a certain number of transistors, and the
goal is to spend them as effectively as possible.)
Since the penalty for missing the TLB is simply one or two
additional memory accesses, the TLB tends to be small. The Pentium III
processor has 64 TLB entries, for example.
Each TLB entry has two pieces: the virtual page number and the page table entry. When the CPU wants to get a page table entry, it first looks into the TLB and sees if any of the cached virtual page numbers match the page requested. If it's there, then the CPU has the page table entry immediately. If not, then the CPU loads the page table entry from memory (and caches it for the future).
This may sound like a slow process: Isn't this process of checking the desired page number against every TLB entry slow? In classical computer science, checking every TLB entry is an $O(n)$ operation, which seems slower than desirable, even if n is only 64. But it's important to remember that the TLB is implemented as a logic circuit, and each TLB entry would have its own set of transistors to compare its page number against the desired page number. Thus, the comparison against all the TLB entries actually occurs in parallel. Doubling the number of TLB entries does not appreciably increase the amount of time to perform this check. (It does double the number of transistors, however.)
The TLB significantly increases the efficiency of virtual memory. Say a memory access takes 10 ns. Without a TLB, each virtual memory access requires 20 ns — one to read from the page table, and one to read from the actual memory location. (We're neglecting page faults for the purpose of this example.) But with a TLB, the hardware can check all the TLB entries in negligible time. Thus, when the TLB contains the page table entry, the memory access takes just 10 ns. If the TLB has a 90% success rate, the average time per memory access is 0.9 ⋅ 10 + 0.1 ⋅ 20 = 11 ns. Thus, using a TLB, the system experiences just a 10% slowdown against using raw memory. That's a much more manageable price to pay for the increased convenience of virtual memory.
As more transistors have fit onto CPUs, the TLB has grown larger. In modern processors, the success rate tends to be high. Some CPU designers (including those designing the SPARC and Alpha chips) have decided that the TLB hit rate was high enough that perhaps the job of managing the page table should be handed over entirely to the operating system. In these systems, any TLB miss causes a TLB fault, and the operating system is responsible for its page entry into the TLB. (If the requested page isn't in memory, then the OS must also load the page.) By delegating TLB management to the operating system, the CPU no longer must worry about page tables or page faults. This saves transistors and, since the CPU no longer has any reason to refer to the page table, it reduces the instruction set's complexity and gives more flexibility to the operating system designer.
17.3.2. Large tables: Table directories
To get around the hugeness of the page tables, a common solution is to add another level of indirection — the page directory. Now each memory address has three pieces.
Say the memory address's three pieces are t, p, and o. First the MMU looks at element t of the page directory, which specifies where the page table is. Then it looks at element p of this page table to determine the memory address where the page starts. Finally, it looks at offset o in the page to get the actual information.
Suppose, for example, that we had a system which used 3 bits for t, 3 bits for p, and 3 bits for o, and memory looked as follows.
If a process requested the memory at the virtual address 010101011, then it would split this address into t = 010(2), p = 101(2), and o = 011(2). It would go to location t = 2 in the page directory, which says to go to table A, and it would go to entry p = 5 in this table, which says to go to page B, where it would find the data requested o = 3 bytes into this page.
The page directory has two virtues. First, it means that large unused portions of the virtual address space do not require page table entries, since such a page table can simply be omitted from the page directory. This can provide significant space savings. Just as significantly, the individual page tables can be paged themselves, so the system does not need to allocate memory for page entries of infrequently-used ranges of the address space.
The downside is that page directories enlarge the speed problem: We've added yet another memory access to the process of decoding a virtual address. Such an additional cost would be a major problem with page directories, were it not the case that the TLB caches page table entries effectively.
17.3.3. Large tables: Inverted page tables
Note that page directories solve the size problem by adding another layer of indirection. It doesn't really save the overall space needed for the page table unless vast ranges of the address space are unused. An alternative approach to the size problem is to observe that only the entries for pages currently in memory are useful: The only thing we need to know for pages stored on the disk is that they aren't in memory. This is the idea behind the inverted page table, in which we replace the page table with a hash table which maps indices for pages loaded in memory to page table entries. If a page isn't currently loaded, we can omit its entry from the hash table. Thus, the inverted page table must be big enough to accommodate only the number of page frames, not the number of virtual pages (as with a regular page table).
With the trend toward larger virtual memory sizes, especially with the movement into the 64-bit range, an inverted page table becomes more relevant. They slow access time significantly, but a large TLB reduces the need to incur this expense.
17.4. Analysis: Segments and virtual memory
Segmentation and virtual memory both exist to address problems related to high demand for memory, but the similarity stops there. In fact, they are compatible, and most modern computing systems employ both. In a typical combination of the two techniques, each segment has its own page table.
While the originating motivation for virtual memory is to provide a larger address space than supported directly by RAM, this has become less relevant as RAM has become cheaper. Nonetheless, virtual memory's advantages remain potent, especially in combination with segments. Consider the following.
If each segment has its own page table, then the OS can devote a huge virtual address space to each segment. A program may use a small fraction of the address space given it, but the system does not need to allocate space for unused pages anywhere — not even on disk. Thus, the only loss is the space used for the page table. This saves the programmer from having to worry about memory limits at all, except for whatever very large limit the OS gives. In Windows 2000, for example, each process gets its own 2GB virtual address space; if 100 processes are active (which would be normal), this amounts to 200GB of virtual address space.
The OS can easily arrange for multiple processes to share the same page of memory. Each process can have different page table entries refer to the same virtual page. This permits for processes to share data. Using this memory, one process can quickly transfer information to another. (Without this capability, the operating system would be forced to be the middleman, and it would slow communication considerably.)
Suppose a Unix shell does a fork system call, with the child process quickly following it with an execvp system call. This seems like a waste of time, since all the shell's memory must be duplicated for the fork (and copying all that memory takes time), but it is used only briefly before being replaced with the new program. But if each segment has its own page table, then the OS can duplicate the page table only, with the two processes sharing pages. The OS can wait to duplicate each page until either the parent or the child modifies a page in memory. In fact, very few pages will be modified: The child will make its execvp system call very soon, at which time it no longer shares this memory; in the meantime, neither process will have the chance to modify much memory, so very few pages need to be duplicated. Thus the fork system call does not actually incur the cost of copying the process's memory in most cases.
These many advantages of virtual memory retain its usefulness in systems with plentiful RAM.



