11

Introduction

Using the base address and limit registers, the operating system can easily relocate different processes to different physical memory areas. However, for these memory areas, there is a large block of "free" space between the stack and the heap. The space between the stack and the heap is not used by the process, but it still occupies the actual physical memory. Therefore, the virtual memory that is simply implemented through the base address register and the limit register is very wasteful.

Generalized base address/boundary

To solve this problem, the concept of segmentation came into being. The idea is simple. Introduce more than one base address and limit register pair in the MMU, but a pair for each logical segment in the address space. A segment is just a continuous fixed-length area in the address space. In a typical address space, there are three logically different segments: code, stack, and heap. The segmentation mechanism enables the operating system to put different segments into different physical memory areas, thereby avoiding the unused part of the virtual address space from occupying physical memory.

Let's look at an example. As shown in the figure, 3 segments are placed in 64KB of physical memory.

image.png

You will think that the hardware structure in the MMU is needed to support segmentation: in this case, a set of 3 pairs of base and limit registers are required. The following table shows the register values in the above example. Each limit register records the size of a segment.

image.png

references which segment

The hardware uses segment registers during address conversion. How does it know the offset within the segment and which segment the address refers to?

A common method, sometimes called explicit (explicit), is to use the first few digits of the virtual address to identify different segments. In our previous example, there are 3 segments, so two bits are needed to identify them. If we use the first two digits of a 14-bit virtual address to identify, then the virtual address is as follows:

image.png

The first two bits tell the hardware which segment we refer to, and the remaining 12 bits are the offset within the segment. Therefore, the hardware uses the first two bits to determine which segment register to use, and then uses the last 12 bits as the offset within the segment. The offset is added to the base address register, and the hardware gets the final physical address. Note that the offset also simplifies the judgment of segment boundaries. We only need to check whether the offset is smaller than the limit, and the address larger than the limit is illegal.

The above uses two bits to distinguish segments, but there are actually only 3 segments (code, heap, stack), so the address space of one segment is wasted. Therefore, some systems treat the heap and the stack as the same segment, so only one bit is needed for identification.

The hardware also has other methods to determine which segment a specific address is in. In the implicit (implicit) mode, the hardware determines the segment by address generation. For example, if the address is generated by the program counter (that is, it is an instruction fetch), then the address is in the code segment. If it is based on the stack or base pointer, it must be in the stack segment. Other addresses are in the heap segment.

Stack problem

One key difference between the stack and other memory segments is that it grows in the reverse direction, so the address translation must be different. First, we need a little hardware support. In addition to the base address and boundaries, the hardware also needs to know the direction of segment growth (using one bit to distinguish, for example, 1 means small and large growth, 0 vice versa).

supports sharing

With the continuous improvement of the segmentation mechanism, people quickly realized that with a little more hardware support, new efficiency improvements can be achieved. Specifically, to save memory, it is sometimes useful to share certain memory segments between address spaces.

In order to support sharing, some additional hardware support is required, which is the protection bit. Basically, a few bits are added to each segment to identify whether the program can read and write the segment or execute the code in it. By marking the code segment as read-only, the same code can be shared by multiple processes without worrying about breaking the isolation. Although each process thinks it owns this memory exclusively, the operating system secretly shares the memory, and the process cannot modify this memory, so the illusion is maintained.

With the protection bit, the hardware algorithm described above must also be changed. In addition to checking whether the virtual address is out of range, the hardware also needs to check whether specific access is allowed. If a user process attempts to write to a read-only segment or execute an instruction from a non-executable segment, the hardware will trigger an exception and let the operating system handle the error process.

operating system supports

Segmentation also brings some new problems. The first one is the old question: what should the operating system do during context switching? The answer is obvious: the contents of each segment register must be saved and restored. Each process has its own independent virtual address space, and the operating system must ensure that these registers are correctly assigned before the process runs.

The second issue is more important, that is, managing the free space of physical memory. When a new address space is created, the operating system needs to find space for its segment in physical memory. Before, we assumed that all address spaces are of the same size, and physical memory can be considered as some slot blocks that processes can put in. Now, every process has some segments, and the size of each segment may also be different.

The general problem is that the physical memory quickly fills up with many small holes in the free space, so it is difficult to allocate to a new segment or expand an existing segment. This problem is called external fragmentation [R69], as shown in the figure.

image.png

One solution to this problem is to compact the physical memory and rearrange the original segments. For example, the operating system first terminates the running processes, copies their data to a contiguous memory area, changes the value in their segment register, and points to a new physical address, thereby obtaining a large enough continuous free space. However, the cost of compact memory is very high, because the copy segment is memory-intensive and generally consumes a lot of processor time.

A simpler approach is to use the free list management algorithm to try to reserve large memory blocks for allocation. There may be hundreds or thousands of related algorithms, including the traditional best-fit (best-fit, which returns from the free linked list to the free block closest to the allocated space), worst-fit, and the first match ( first-fit) and more complex algorithms like buddy algorithm. Unfortunately, no matter how sophisticated the algorithm is, it cannot completely eliminate external fragments. Therefore, a good algorithm is just trying to reduce it.

Free space management

Let's put aside the discussion of virtual memory for the time being, let's discuss some issues of free-space management (free-space management) first. When the operating system uses segmentation to implement virtual memory, and in user-level memory allocation libraries (such as malloc() and free()), the free space that needs to be managed is composed of units of different sizes. In both cases, there will be external fragmentation issues, making management more difficult.

assuming

First, we assume that the basic interface is as provided by malloc() and free(). Specifically, void * malloc(size t size) requires a parameter size, which is the number of bytes requested by the application. The function returns a pointer to a space of this size. The corresponding function void free (void *ptr) function accepts a pointer and releases the corresponding memory block.

Suppose further that we are mainly concerned with the problem of external fragmentation, and we should put the problem of internal fragmentation behind. We also assume that once the memory is allocated to the client, it cannot be relocated to other locations. Finally, we assume that what the allocation program manages is a contiguous block of bytes.

underlying mechanism

Separate and merge

The free list contains a set of elements that record which space in the heap has not been allocated. Assume the following 30-byte heap:

image.png

The free list corresponding to this heap is as follows:

image.png

It can be seen that any allocation request larger than 10 bytes will fail (return NULL) because there is not enough contiguous free space. However, what will happen if you apply for less than 10 bytes of space? Assuming that we only apply for one byte of memory, the allocation program will perform the so-called splitting action: it finds a piece of free space that can satisfy the request, divides it, the first piece is returned to the user, and the second piece is left in Free list. In our example, assuming that a request for one byte is encountered at this time, the allocation program chooses to use the second free space, the call to malloc() will return 20 (the address of the 1-byte allocation area), and the free list will be Becomes like this:

image.png

There is also a mechanism in many allocation procedures called coalescing. What happens if the application calls free(10) to return the space in the middle of the heap? The allocator not only simply adds this free space to the free list, it also merges the available space when it releases a block of memory. After merging, the final free list should look like this:

image.png

tracks the size of the allocated space

As you can see, the free(void *ptr) interface does not have a block size parameter. To accomplish this task, most allocators store a little extra information in the header, which is usually located in memory just before the returned memory block. The header block contains at least the size of the allocated space. It may also contain some extra pointers to speed up the space release, a magic number to provide integrity checks, and other information. We assume that a simple header block contains the size of the allocated space and a magic number:

typedef struct  header_t {
    int size;
    int magic;
} header_t;

It looks like this in memory:

image.png

When the user calls free(ptr), the library will get the position of the head block through simple pointer arithmetic. After obtaining the pointer of the head block, the library can easily determine whether the magic number meets the expected value, as a normality check (assert (hptr->magic == 1234567)), and simply calculate the amount of space to be released (ie, the head block Plus the length of the area). Therefore, if the user requests N bytes of memory, the library is not looking for a free block of size N, but a free block of N plus the size of the header block.

Let the heap grow

Most traditional allocators will start from a very small heap, and when the space is exhausted, they will apply for more space to the operating system. Usually, this means that they make some kind of system call (for example, sbrk in most UNIX systems) to let the heap grow. operating system executes the sbrk system call, it will find free physical memory pages, map them to the address space of the requesting process, and return the end address of the new heap. At this time, there is a larger heap, and the request can be successfully satisfied.

Basic strategy

The ideal allocation procedure can ensure both rapidity and minimal fragmentation. Unfortunately, because the sequence of requests for allocation and release is arbitrary, any strategy will become very poor under certain specific inputs.

Best match

The best fit strategy is very simple: first traverse the entire free list, find a free block that is the same or larger than the requested size, and then return the smallest block in this group of candidates. The idea behind optimal matching is simple: choose the block closest to the size requested by the user, so as to avoid wasting space as much as possible. However, a simple implementation has to pay a higher performance cost when traversing to find the correct free block.

Worst match

The worst fit method is opposite to the best fit. It tries to find the largest free block, divides and meets the user's needs, and adds the remaining blocks to the free list. The worst match tries to keep larger blocks in the free list, instead of leaving many small blocks that are difficult to use like the best match. The worst match also requires traversing the entire free list. Most studies have shown that it performs very poorly, leading to excessive fragmentation and high overhead.

for the first time

The first fit strategy is to find the first large enough block and return the requested space to the user. The first match has a speed advantage (no need to traverse all free blocks), but sometimes there are many small blocks at the beginning of the free list. Therefore, how the allocation program manages the order of the free list becomes very important. One method is address-based ordering. By keeping free blocks in order by memory address, the merge operation will be easy, thereby reducing memory fragmentation.

next match

Unlike the first match, which searches from the beginning of the list every time, the next fit algorithm maintains an additional pointer to the end of the previous search. The idea is to spread the search for free space to the entire list, avoiding frequent segmentation at the beginning of the list. The performance of this strategy is very close to that of the first match, and it also avoids traversal search.

Other methods

In addition to the above-mentioned basic strategies, many techniques and algorithms have been proposed to improve memory allocation.

separate free list

The basic idea is simple: if an application often applies for one (or several) sizes of memory space, use an independent list to manage only objects of this size. Requests of other sizes are passed to a more general memory allocation program.

The benefits of this approach are obvious. By devoting a portion of memory to specifically satisfy requests of a certain size, fragmentation is no longer a problem. Moreover, because there is no complicated list lookup process, this specific size of memory is allocated and released very quickly.

However, this approach also introduces new complexity to the system. For example, how much memory should be dedicated to servicing requests of a certain size, and the rest should be used to satisfy general requests? The slab allocator (slab allocator) designed by the Solaris system kernel elegantly handles this problem.

Specifically, when the kernel starts, it creates some object caches for kernel objects that may be frequently requested, such as locks and file system inodes. Each of these object caches is separated from a free list of a specific size, so it can quickly respond to memory requests and releases. If the free space in a cache is almost exhausted, it will apply for some memory chunks (slabs) from the general memory allocation program (the total amount is a common multiple of the page size and the object size). Conversely, if the reference count of an object in a given thick block becomes 0, the general memory allocator can reclaim this space from the specialized allocator. This usually happens when the virtual memory system needs more space.

The thick block allocator does more than most separate free lists, it keeps the free objects in the list in a pre-initialized state. The initialization and destruction of the data structure are expensive. By keeping the free objects in the initialization state, the thick block allocation program avoids frequent initialization and destruction, thereby significantly reducing the cost.

System

Because merging is critical to the allocation process, people have devised some methods to make merging easy. A good example is the binary buddy allocator.

In this system, free space is first conceptually regarded as a large space with a size of 2ⁿ. When there is a memory allocation request, the free space is recursively divided into two until the requested size can be met. At this time, the requested block is returned to the user. In the following example, a 64KB free space is split to provide 7KB blocks:

image.png

Please note that this allocation strategy only allows the allocation of free blocks that are powers of two, so there will be internal fragmentation troubles.

The essence of the buddy system is when the memory is released. If this 8KB block is returned to the free list, the allocation program will check whether the "partner" 8KB is free. If it is, the two are combined into one and become a 16KB block. It will then check whether the partner of this 16KB block is free, and if so, merge the two blocks. This recursive merging process continues upward until the entire memory area is merged, or the partner of a certain block has not been released.

The reason the buddy system works well is that it is easy to determine the buddy of a block. If you look closely, you will find that different from each other, and it is this one that determines their level in the entire buddy tree, .

Other ideas

Many of the methods mentioned above have an important problem, the lack of scalability (scaling). Specifically, looking up the list can be slow. Therefore, more advanced allocation programs use more complex data structures to optimize this overhead, sacrificing simplicity in exchange for performance. Examples include balanced binary trees, stretched trees, and partially ordered trees.

Considering that modern operating systems usually have multiple cores and run multi-threaded programs at the same time, people have done a lot of work to improve the performance of allocating programs on multi-core systems. If you are interested, you can read more about the working principle of the glibc allocation program.


与昊
225 声望636 粉丝

IT民工,主要从事web方向,喜欢研究技术和投资之道