The essence of virtual memory

Original blog

preface

Virtual memory is one of the most important abstract concepts in today's computer systems. It is proposed to manage memory more effectively and reduce the probability of memory errors. Virtual memory affects all aspects of computers, including hardware design, file system, shared objects, process/thread scheduling, etc. Every programmer who is committed to writing efficient and low error probability programs should learn virtual memory in depth.

This article analyzes the working principle of virtual memory in a comprehensive and in-depth manner to help readers quickly and deeply understand this important concept.

Computer memory

The memory is one of the core components of the computer. In a perfectly ideal state, the memory should have the following three characteristics at the same time:

Fast enough: the access speed of the memory should be faster than the CPU to execute an instruction, so that the efficiency of the CPU will not be limited by the memory
The capacity is large enough: the capacity can store all the data needed by the computer
The price is cheap enough: the price is low, all types of computers can be equipped

But the reality is often cruel. Our current computer technology cannot meet the above three conditions at the same time, so the memory design of modern computers adopts a hierarchical structure:

From top to bottom, the types of memory in modern computers are: registers, caches, main memory, and disks. The speed of these memories decreases step by step and their capacity increases step by step. The fastest access speed is the register. Because the material of the register is the same as that of the CPU, the speed is as fast as the CPU. There is no time delay when the CPU accesses the register. However, because it is expensive, the capacity is also extremely small, generally 32 bits The register capacity of the CPU is 32✖️32 Bit, and the 64-bit CPU is 64✖️64 Bit. Regardless of whether it is 32-bit or 64-bit, the register capacity is less than 1 KB, and the register must also be managed by software.

The second layer is the cache, that is, the CPU caches L1, L2, and L3 that we usually understand. Generally, L1 is exclusive to each CPU, L3 is shared by all CPUs, and L2 is designed to be independent according to different architecture designs. One of the two modes: sharing or sharing. For example, Intel's multi-core chips use the shared L2 mode while AMD's multi-core chips use the exclusive L2 mode.

The third layer is the main memory, that is, the main memory, usually called Random Access Memory (Random Access Memory, RAM). It is the internal memory that directly exchanges data with the CPU. It can read and write at any time (except when refreshing), and it is very fast. It is usually used as a temporary data storage medium for the operating system or other running programs.

Finally, there is the disk. Compared with the main memory, the cost of each binary bit is two orders of magnitude lower than that of the main memory. Therefore, the capacity is much larger than that, and the access speed is slower than that of the main memory. It's about three orders of magnitude. The slow speed of mechanical hard disks is mainly because the robotic arm needs to constantly move between the metal disks, waiting for the disk sector to rotate below the head, and then read and write operations, so the efficiency is very low.

Main memory

Physical memory

The physical memory we usually mention is the third kind of computer memory corresponding to the above, RAM main memory, which exists in the form of a memory stick in the computer, embedded in the memory slot of the motherboard, and used to load all kinds of Programs and data are for the CPU to run and use directly.

Virtual Memory

There is a sacred saying in the computer field that is as sacred as the Moses Ten Commandments: " Any problem in the computer science field can be solved by adding an indirect middle layer ", from memory management, network models, concurrent scheduling and even hardware architecture , You can see that this philosophical saying is shining brightly, and virtual memory is one of the perfect practices of this philosophical saying.

Virtual memory is a very important memory abstraction in modern computers. It is mainly used to solve the increasing memory usage requirements of applications: the capacity of modern physical memory has grown very rapidly, but it still can’t keep up with the application’s demand for main memory. For the application, the memory may not be enough for the application, so a method is needed to solve the contradiction between the capacity difference between the two. In order to manage memory more efficiently and eliminate program errors as much as possible, modern computer systems abstract the physical main memory RAM and implement the Virtual Memory (VM) technology.

Virtual Memory

The core principle of virtual memory is: set up a section of "contiguous" virtual address space for each program, divide this address space into multiple pages with continuous address ranges (Page), and map these pages with physical memory. Dynamically mapped to physical memory during program operation. When the program refers to an address space in physical memory, the hardware immediately performs the necessary mapping; when the program refers to an address space that is not in physical memory, the operating system is responsible for loading the missing part into the physical memory and rebuilding it. The failed instruction was executed.

In fact, from a certain point of view, virtual memory technology is like a new technology combining base address registers and limit registers. It allows the address space of the entire process to be mapped to physical memory through a smaller virtual unit, without the need to relocate the code and data addresses of the program.

The virtual address space is divided into a number of units called pages according to a fixed size, and the corresponding physical memory is a page frame. The two are generally the same size, as shown in the figure above, it is 4KB, but in fact, it is generally 512 bytes to 1 GB in computer systems. This is the paging technology of virtual memory. Because it is a virtual memory space, the allocated size of each process is 4GB (32-bit architecture), but it is of course impossible to allocate 4GB of physical memory to all running processes, so virtual memory technology also needs to use a swapping (swapping) technology, also known as the page replacement algorithm, only allocates and maps the memory currently used during the process of running, and temporarily unused data is written back to disk as a copy for storage, and then read in when needed Memory, which dynamically exchanges data between disk and memory.

`Page table`

Page Table, every time a virtual address is mapped to a physical address, the page table needs to be read. From a mathematical point of view, the page table is a function. The input parameter is the virtual page number (Virtual Page Number, abbreviated as Virtual Page Number). VPN), the output is the Physical Page Number (PPN for short, which is the base address of the physical address).

The page table is composed of multiple page table entries (Page Table Entry, PTE for short). The structure of the page table entries depends on the machine architecture, but they are basically the same. Generally speaking, the page table entries will store information such as the physical page frame number, modified bits, access bits, protection bits, and "presence/absence" bits (valid bits).

Physical page frame number: This is the most important field value in the PTE. After all, the meaning of the page table is to provide a mapping from VPN to PPN.
Valid bit: Indicates whether the page currently exists in main memory, 1 means it exists, and 0 means missing. When a process tries to access a page with a valid bit of 0, it will cause a page fault interrupt.
Protection bit: indicates the type of access allowed for the page, for example, 0 means readable and writable, 1 means read-only.
Modification bit and access bit: Introduced in order to record the usage of the page, it is generally used by the page replacement algorithm. For example, when a memory page is modified by the program, the hardware will automatically set the modification bit. If the next time the program has a page fault interruption, the page replacement algorithm needs to be run to call out the page to make room for the page that is about to be loaded. It will visit the modification bit first to know that the page has been modified, that is, the dirty page (Dirty Page), you need to write the latest page content back to the disk for storage, otherwise it means that the copy content on the memory and the disk is synchronized Yes, there is no need to write back to the disk; and the access bit is also automatically set by the system when the program accesses the page. It is also a value used by the page replacement algorithm. The system will decide whether to eliminate the page based on whether the page is being accessed. Generally speaking, pages that are no longer used are more suitable to be eliminated.
Cache prohibition bit: Used to prohibit pages from being placed in the CPU cache. This value is mainly applicable to memory pages that are mapped to real-time I/O devices such as registers instead of ordinary main memory. This type of real-time I/O devices need to be To the latest data, the data in the CPU cache may be an old copy.

`Address translation`

The memory addresses generated by the process during operation are all virtual addresses. If the computer does not introduce the memory abstraction technology of virtual memory, the CPU will send these addresses directly to the memory address bus, and then access the physical address with the same value as the virtual address. ; If the virtual memory technology is used, the CPU sends these virtual addresses to the Memory Management Unit (Memory Management Unit, MMU for short) through the address bus. The MMU translates the virtual address into a physical address and then accesses the physical memory through the memory bus:

The virtual address (such as the 16-bit address 8196=0010 000000000100) is divided into two parts: virtual page number (Virtual Page Number, referred to as VPN, here is the high 4-bit part) and offset (Virtual Page Offset, referred to as VPO, here is low 12-bit part), the virtual address is converted into a physical address through the page table (page table) to achieve.

Here we are based on an example to analyze how the computer hardware interacts when the page hits:

Step 1 : The processor generates a virtual address VA and sends it to the MMU via the bus;
Step 2 : MMU obtains the address PTEA of the page table entry through the virtual page number, and reads the page table entry PTE from the CPU cache/main memory through the memory bus;
Step 3 : CPU cache or main memory returns the page table entry PTE to the MMU through the memory bus;
Step 4 : The MMU first copies the physical page frame number PPN in the page table entry to the upper three bits of the register, and then copies the 12-bit offset VPO to the last 12 bits of the register to form a 15-bit physical address. That is, the physical memory address PA stored in the register can be sent to the memory bus to access the cache/main memory;
Step 5 : CPU cache/main memory returns the data corresponding to the physical address to the processor.

When the MMU performs address translation, if the effective bit of the page table entry is 0, it means that the page is not mapped to the real physical page frame number PPN, and a page fault interrupt will be triggered, and the CPU will fall into the operating system kernel, and then The operating system will use the page replacement algorithm to select a page to swap out (swap) in order to make room for the new page that is about to be transferred. If the modification bit in the page table item of the page to be swapped out has been set, That is, it has been updated, this is a dirty page (Dirty Page), you need to write back to the disk to update the copy of the page on the disk, if the page is "clean", that is, it has not been modified, then directly use the adjustment The new page imported overwrites the old page that was swapped out.

The specific process of page fault interruption is as follows:

steps 1 to 3 : It is consistent with the first 3 steps hit on the previous page;
Step 4 : Check the returned page table entry PTE and find that its valid bit is 0, then the MMU triggers a page fault interrupt exception, and then the CPU transfers to the page fault interrupt processor in the operating system kernel;
Step 5 : The page fault interrupt handler checks whether the required virtual address is legal. After confirming that it is legal, the system checks whether there is a free physical page frame number. PPN can be mapped to the missing virtual page. If there is no free page frame, then Perform page replacement algorithm to find an existing virtual page to be eliminated, if the page has been modified, write it back to disk, and update the copy of the page on the disk;
Step 6 new page from the disk to the memory, and updates the page table entry PTE;
Step 7 : The page fault interrupt program returns to the original process and re-executes the instruction that caused the page fault interrupt. The CPU will resend the virtual address that caused the page fault interrupt to the MMU. At this time, the virtual address already has a mapped physical address. The page frame number is PPN, so it will go through the previous "Page Hit" process, and finally the main memory returns the requested data to the processor.

`Virtual memory and cache`

When analyzing the working principle of virtual memory, when talking about the storage location of the page table, in order to simplify the processing, the main memory and the cache are put together by default. In fact, the more detailed process should be the following schematic diagram:

If a computer is equipped with virtual memory technology and CPU cache at the same time, the MMU will first try to address in the cache every time, if the cache hits, it will return directly, and only go to the main memory for addressing when the cache misses. .

Generally speaking, most systems will choose to use the physical memory address to access the cache, because the cache is much smaller than the main memory, so the use of physical addressing is not too complicated; in addition, because the cache capacity is small Therefore, the system needs to share data blocks between multiple processes as much as possible, and using physical addresses can make it more intuitive for multiple processes to store data blocks in the cache at the same time and share data blocks from the same virtual memory page.

`Speed up translation & optimize page table`

After the previous analysis, I believe that readers have already understood the basics and principles of virtual memory and its paging & address translation. Now we can introduce the two core requirements, or bottlenecks, in virtual memory:

The virtual address to physical address mapping process must be very fast, how to speed up the address translation.
The increase of the virtual address range will inevitably lead to the expansion of the page table, forming a large page table.

These two factors determine whether the virtual memory technology can truly be widely used in computers. How to solve these two problems?

As the article said at the beginning: " can be solved by adding an indirect intermediate layer ." Therefore, although the virtual memory itself is already an intermediate layer, the problems in the intermediate layer can also be solved by introducing another intermediate layer.

The solution to speed up the address translation process is currently by introducing the page table cache module-TLB, while the large page table is solved by implementing a multi-level page table or an inverted page table.

`TLB acceleration`

(Translation Lookaside Buffer, TLB), also called fast table, is used to speed up the translation of virtual addresses. Because of the paging mechanism of virtual memory, the page table is generally stored in a fixed storage area in memory. The MMU needs to match a corresponding PTE from the page table every time it translates a virtual address. As a result, when the process accesses the specified memory data through the MMU, there is one more memory access than a system without a paging mechanism, which generally costs tens to more A few hundred CPU clock cycles, the performance will be reduced by at least half. If the PTE happens to be cached in the CPU L1 cache, the overhead can be reduced to one or two cycles, but we cannot hope that the PTE to be matched every time is exactly in L1 Therefore, it is necessary to introduce an acceleration mechanism, that is, the TLB fast table.

TLB can be simply understood as a page table cache, which stores the most frequently accessed page table entry PTE. Since TLB is generally implemented by hardware, it is extremely fast. When MMU receives a virtual address, it will generally match the corresponding PTE in the page table in parallel through the hardware TLB first. If it hits and the access operation of the PTE does not violate the protection bit (such as Try to write a read-only memory address), then directly fetch the corresponding physical page frame number PPN from the TLB and return. If it does not hit, it will penetrate into the main memory page table for query, and it will be saved after the latest page table entry is queried. Enter the TLB to prepare for the next cache hit. If the current storage space of the TLB is insufficient, one of the existing PTEs will be replaced.

Let's analyze TLB hits and misses in detail below.

TLB hits :

Step 1 : CPU generates a virtual address VA;
steps 2 and 3 : MMU takes out the corresponding PTE from the TLB;
Step 4 : MMU translates this virtual address VA into a real physical address PA, and sends it to the cache/main memory through the address bus;
Step 5 : The cache/main memory returns the data on the physical address PA to the CPU.

TLB misses :

Step 1 : CPU generates a virtual address VA;
2 to 4 : Query TLB fails, take the normal main memory page table query process to get the PTE, and then put it into the TLB cache for the next query. If the TLB storage space is insufficient at this time, Then this operation will replace another existing PTE in the TLB;
Step 5 : MMU translates this virtual address VA into a real physical address PA, and sends it to the cache/main memory through the address bus;
Step 6 : The cache/main memory returns the data on the physical address PA to the CPU.

`Multi-level page table`

The introduction of TLB can solve the cost problem of virtual address to physical address translation to a certain extent, and then another problem needs to be solved: large page table.

Theoretically, the addressing space of a 32-bit computer is 4GB, which means that the theoretical virtual addressing range of each process running on the computer is 4GB. So far, what we have been discussing is the case of a single page table. If each process loads theoretically available memory pages into a page table, the actual memory used by the process may actually be only A small part, and we also know that the page table is also stored in the computer's main memory, which will inevitably cause a lot of waste of memory, and may even lead to insufficient physical memory of the computer to run more processes in parallel.

This problem is generally solved by the multi-level page table (Multi-Level Page Tables). By splitting a large page table to form a multi-level page table, let’s take a concrete look at how a secondary page table should be designed: Assuming that a virtual address is 32 bits, it consists of a 10-bit first-level page table index, a 10-bit second-level page table index, and a 12-bit address offset, then the PTE is 4 bytes, and the page size is 2^12 = 4KB, a total of 2^20 PTEs are required. Each PTE in the primary page table is responsible for mapping a 4MB chunk in the virtual address space. Each chunk is composed of 1024 consecutive pages. If the addressing space is 4GB , Then a total of only 1024 PTEs are enough to cover the entire process address space. Each PTE in the secondary page table is responsible for mapping to a 4KB virtual memory page, which is the same principle as the single page table.

The key to multi-level page tables is that we do not need to allocate a secondary page table for each PTE in the primary page table, but only need to allocate and map the addresses currently used by the process. Therefore, for most processes, there are a large number of vacant PTEs in their first-level page tables, so the second-level page tables corresponding to these PTEs will not need to exist. This is a considerable memory saving. In fact, for a For a typical program, most of the theoretically available 4GB virtual memory address space will be in such an unallocated state; furthermore, during the running of the program, only the first-level page table needs to be placed in the main memory. The virtual memory system can create, call in and call out the secondary page table when it is actually needed, so that it can ensure that only those secondary page tables that are most frequently used will reside in the main memory. This is also true. Greatly ease the pressure on the main memory.

The level depth of the multi-level page table can be continuously expanded as required. Generally speaking, the more levels there are, the higher the flexibility.

For example, there is a k-level page table, the virtual address is composed of k VPNs and 1 VPO, and each VPN i is an index to the i-th level page table, where 1 <= i <= k. Each PTE (1 <= j <= k-1) in the j-th level page table points to the base address of the j+1-th level page table. Each PTE in the k-th page table contains the page frame number PPN of a physical address, or the address of a disk block (the memory page has been swapped out to the disk by the page replacement algorithm). The MMU needs to access k PTEs each time to find the physical page frame number PPN and then add the offset VPO in the virtual address to generate a physical address. Here readers may express performance concerns about the MMU accessing k PTEs each time. At this time, it is time for TLB to appear. The computer caches the PTEs in each level of page table in the TLB to make multilevel pages. The performance of the table is not too far behind single-page tables.

`Inverted page table`

Another solution to the problem of page-based virtual memory management large page table is inverted page table (Inverted Page Table, referred to as IPT). The principle of the inverted page table is similar to the inverted index of the search engine, and both are implemented through the inverted mapping process.

In search engines, there are two concepts: document doc and keyword keyword. Our requirement is to quickly find the corresponding doc list through keyword. If the storage structure of the search engine is a forward index, that is, it is mapped to it through doc. All the keyword lists, then we need to find the doc list corresponding to a specified keyword, then we need to scan all the doc in the index library, find the doc that contains the keyword, and then score according to the scoring model, and return after sorting. The design is undoubtedly inefficient; so we need to reverse the forward index to get the inverted index, that is, map to all the doc lists that contain it through the keyword, so that when we query the doc list that contains a specified keyword, only You need to use the inverted index to quickly locate the corresponding result, and then sort and return according to the scoring model.

The above description is only the simplified principle of the inverted index of search engines. The actual inverted index design is much more complicated. Interested readers can search for information and learn by themselves, which will not be expanded here.

Going back to the inverted page table of virtual memory, it uses a similar idea to the inverted index, which reverses the mapping process: the page table design we have learned earlier uses the virtual address page number as the page table entry PTE index , Is mapped to the physical page frame number PPN, and in the inverted page table, PPN is used as the PTE index, which is mapped to (process number, virtual page number VPN).

The inverted page table is especially efficient under the CPU architecture with larger addressing space, or it should be said that it is more suitable for those scenarios where the ratio of "virtual memory space/physical memory space" is very large, because this design uses the actual physical memory page frame as PTE index instead of using virtual memory far exceeding physical memory as an index. For example, taking the 64-bit architecture as an example, if it is a single page table structure, and 12 bits are used as the page address offset, that is, the memory page size of 4KB, then the calculation in the most theoretical way requires 2^52 A PTE, each PTE occupies 8 bytes, then the entire page table requires 32PB of memory space, which is completely unacceptable, and if an inverted page table is used, assuming 4GB of RAM is used, only 2^20 is required PTE greatly reduces memory usage.

Although the inverted page table is effective in saving memory space, it also introduces another major flaw: the address translation process becomes more inefficient. We all know that MMU's job is to translate virtual memory addresses into physical memory addresses. Now the index structure has changed, and the physical page frame number PPN is used as an index. From the original VPN --> PPN to PPN --> VPN, then When the process tries to access a virtual memory address, after the CPU sends the VPN to the MMU through the address bus, based on the design of the inverted page table, the MMU does not know whether the VPN corresponds to a page fault, so it has to scan the entire Invert the page table to find the VPN, and the most terrible thing is that even if it is a non-page fault VPN, each memory access still needs to perform this full table scan operation. Assuming the 4GB RAM example mentioned earlier, it is equivalent to Scanning 2^20 PTEs each time is quite inefficient.

At this time it is time for our old friend-TLB to come out. We only need to cache the frequently used pages in the TLB. With the help of hardware, the translation process of the virtual memory address in the case of a TLB cache hit can be like The normal page table is as fast, but when the TLB fails, you still need to scan the entire inverted page table through software. The linear scan method is very inefficient. Therefore, the general inverted page table is implemented based on a hash table. Assuming 1G of physical memory, there are a total of 2^18 4KB page frames. A hash table with VPN as the key is established. The value corresponding to each key value is stored in (VPN, PNN) , Then all VPNs with the same hash value will be linked together to form a collision chain. If we set the number of slots in the hash table to be the same as the number of physical page frames, then the collision chain in this inverted hash table The average length of will be 1 PTE, which can greatly improve the query speed. After the VPN matches the PPN through the inverted page table, the (VPN, PPN) mapping relationship will be cached into the TLB immediately to speed up the next virtual address translation.

The inverted page table is very common in 64-bit architecture computers, because in the 64-bit architecture, even if the page size is increased from the normal 4KB to 4MB in the paging-based virtual memory, it still needs a 2^42 PTE The huge page table is placed in the main memory (theoretically, it will not be implemented in this way), which consumes a lot of memory.

`to sum up`

Now let us review the core content of this article: virtual memory is an intermediate layer that exists between the computer's CPU and physical memory. Its main function is to efficiently manage memory and reduce memory errors. Several core concepts of virtual memory are:

page table : From a mathematical point of view, the page table is a function. The input parameter is the virtual page number VPN, and the output is the physical page frame number PPN, which is the base address of the physical address. The page table is composed of page table entries. The page table entries store all the information needed for address translation. The page table is the basis for the normal operation of virtual memory. Every virtual address needs to be translated into a physical address. carry out.
TLB : Computer hardware, mainly used to solve the performance problems of addressing after the introduction of virtual memory and speed up address translation. If there is no TLB to solve the performance problem of virtual memory, then virtual memory will only be an academic theory and cannot be widely used in computers.
multi-level page table and inverted page table : used to solve the problem of large page tables caused by the explosive expansion of virtual address space, multi-level page tables are inverted by splitting single page tables and allocating virtual memory pages on demand The page table achieves the effect of saving memory by reversing the mapping relationship.

Finally, the virtual memory technology also needs to involve the page replacement mechanism of the operating system. Since the page replacement mechanism is also a relatively complex and complex concept, this article will not continue to analyze the principles of this part, and we will take it separately in future articles. Come to explain.

`Reference & Further Reading`

The main reference materials for this article are the original English editions of the two books "Modern Operating Systems" and "In-depth Understanding of Computer Systems". If readers want to learn more about virtual memory, they can read these two books in depth and search for other papers. To learn.

The essence of virtual memory

Original blog

preface

Computer memory

Main memory

Physical memory

Virtual Memory

Virtual Memory

`Page table`

`Address translation`

`Virtual memory and cache`

`Speed up translation & optimize page table`

`TLB acceleration`

`Multi-level page table`

`Inverted page table`

`to sum up`

`Reference & Further Reading`

panjf2000

`引用和评论`

Go 语言中的零拷贝优化

GPUDirect RDMA 的演进与实现

《ESP32-S3使用指南—IDF版 V1.6》第十三章 UART实验

AIoT 智变浪潮演讲实录 | 刘浩然：让硬件会思考：边缘大模型网关助力硬件智能革新

「2024龙蜥社区年度优秀贡献者」榜单公布，恭喜上榜企业和个人

《Operating System Concepts》阅读笔记：p203-p207

共探 AI 硬件未来图景，火山引擎“智变浪潮”技术沙龙圆满落幕