"How Linux Works" Reading Notes

introduction

This book is the most accessible and easy to understand in the bottom layer of the operating system that I have read, but the content of more than 200 pages really can't explain much content, so don't have high expectations for this book. Get to know the introductory book.

Book introduction

The original Fujitsu first-line Linux kernel development engineer wrote based on more than ten years of experience, professional and practical
Focus on the core functions of Linux, and explain how the operating system works in an easy-to-understand manner
198 schematic diagrams to help understanding, detailed and appropriate, to open the way to understand the tome
Combine a large number of experimental programs, learn while operating, and experience the operation process of the system

personal evaluation

The content is relatively basic, but the content related to Linux is covered. In addition, the author uses C language programs to verify and test the operating system cache, swap memory, CPU process manager, solid-state hard disk and mechanical hard disk random read and write and sequential read and write. The program is more interesting.

But I have to say that it is not easy to describe the abstract things such as operating systems in a vivid image. The author has no doubts about his professionalism in Linux kernel development in Japan's first-tier manufacturers (you can look through Fujitsu related materials). The arrangement is from the shallower to the deeper, which is pretty good.

Summary: A book that is very difficult to locate. It is recommended to refer to the knowledge in it and go deeper according to the relevant content.

resource

A book published in March 2022 so no relevant resources were found.

The following contents are some programs that simulate the bottom layer of the operating system in C language. If you are interested, you can download and take a look.

Link: https://pan.baidu.com/s/1eU65e1OKZEgMrxGdkWT2xA Extraction code: pghr

note index

Note that the index of notes is not organized according to the structure of the original book, because individuals read this book "upside down", and combined with the table of contents, it is found that reading from the back to the front is more in line with personal understanding habits, that is, from the external memory to the internal memory. The working mechanism is more in line with personal thinking.

You can click on the subtitle to jump to the relevant node .

The conventional understanding is that knowledge proceeds from the shallower to the deeper. In fact, sometimes it may be more in line with people's study habits to use the cross-learning method of difficulty and ease.

Part 1: Introduction to Linux and External Structures

It mainly introduces the comparison of the working mechanism of mechanical disk and SSD hard disk, and the difference between sequential read and write and random read and write. In this part, it is interesting to use C language to simulate the read and write performance of disk.

The file system design of Linux and device interaction is introduced, which is divided into 7 layers. Of course, the book is only a brief summary. If you want to go deeper, you need to read more information.

Describes the related content of the IO scheduler and the read-ahead mechanism.

Part II: Linux File System Design

This section describes how to quickly understand the design of a Linux file system. Of course, the design of a file system can be explained clearly in a few pages, but it is good for us to have a general understanding of the overall design ideas of Linux.

Part III: Brief Analysis of Computer Storage Hierarchy

If you have a basic understanding of laptops or desktop motherboards, etc., or have a general understanding of the working process of the entire operating system, this section can be skipped completely.

The memory hierarchy of the computer is closer to the CPU and the CPU is more closely related, the price is higher, the capacity is smaller, our common memory, the order from fast to slow is: register -> cache -> memory -> external memory, this Sections describe these storage tiers.

Later, we will introduce the translation lookaside buffer, page cache, buffer cache and cache tuning parameters that are not common and rarely used in Linux.

Part 4: Linux Memory Management and Optimization

Memory management is the core and key of operating system process management. This part introduces the content of memory management. Memory management is the most detailed part of the entire book. Personally, I think the core is to master request paging and copy-on-write , these two This feature is heavily used, in addition to understanding how memory is allocated and the details of the allocation process is also necessary.

In addition, this part of personal notes also splits the content into upper and lower parts while supplementing:

Linux memory management
Linux memory management optimization

Part 5: Process Scheduler
There are currently two mainstream methods of CPU process scheduling. The first is preemptive scheduling like window, where each CPU may have unequal scheduling time allocation, and the other is time slicing, time slicing It is a common process scheduler in Linux. It is characterized in that each process has approximately equal CPU usage rights. After the use is completed, it is immediately handed over to the next process to complete the work. Although the use of sharding may cause some important tasks to be delayed, but such The processing and scheduling methods make the system the most stable.

The process scheduler itself is very complex. In order to reduce the complexity, the author did not introduce too much, so the content of personal notes is relatively small.

Part 0: Overview of Computer Programs
To understand the operation of the operating system, we need to understand the basic concepts of computer information. I think if you have an idea to study the underlying concepts of the operating system, you will not be unfamiliar with the basic concepts of computers, so this part is taken as a summary.

appendix
This part is an extension of the information on the allocation method of the physical disk in the first part, you can read it if you are interested.

Note ⚠: The final personal note organization will be a mix of hard-easy-hard.

Introduction to Linux and External Structures

Introduction to HDD Disks

The logical structure of the mechanical disk is understood as multiple circles similar to a concentric circle, numbered from the outer layer to the inner layer, the disks are numbered in a clockwise order , and the disks are rotated counterclockwise. Sequential scanning in the past, the forward direction of the head is the direction of increasing number.

In the following structure diagram, the track is each concentric circle, and the sector refers to the "sector" formed by cutting the track. It is called a sector because it looks like a fan after cutting, and a magnetic head is required for scanning data. Slide on the track, and the sectors will be numbered from 0, one number corresponds to one sector.

Note that this is a top view of the disk, which means that the lines are the "gullies" on the physical disk, and the sectors are the numbered blocks.

The following is a side view of the disk. The vertically stacked disks are combined with one disk and one head to speed up data processing through multiple heads.

Note that the minimum unit for reading and writing a sector in an HDD disk is 512 bytes, and each sector is 512 bytes, regardless of whether the sector is in the outer layer or the inner layer.

⚠️Note: Many frameworks or databases will set the size of a read and write to 512 bytes. Because 512 is the smallest read and write unit, no additional maintenance is required to ensure the atomicity of read and write.

Disk size calculation

The earliest disk can calculate the size of the entire disk by the following formula, because the number of tracks and sectors are in one-to-one correspondence:

Storage capacity = number of heads, number of tracks (cylinders), number of sectors per track * number of bytes per sector

There is an obvious problem with such a design, that is, whether the sector area is large or small, it is a fixed size . Obviously, the outer sector data is wasted in vain.

In response to such problems, subsequent mechanical disks have been improved. This technology is called ZBR (Zoned Bit Recording) technology . This technology divides the size according to the sectors of each circle, and the sectors in the same track circle are distributed. and the same size.

This means that the outer circle has more sectors, while the inner circle has fewer, and the density is evenly distributed after division.

Due to the improvement of the storage form of disk sectors, the addressing mode will naturally follow the progress. Most of today's hard disks use the LBA (Logical Block Addressing) logical block addressing mode. Only by understanding this addressing mode can you understand the calculation method of the size of the disk.

However, HDD disk read and write is now limited by random read and write speeds. In the past, HDD disks were more popular at 7200 rpm and 5400 rpm, etc. The difference is the ants that run faster and slower.

Although there are SAS hard drives that can break through 15,000 RPM, and there are still research teams looking for different materials or other methods to break through the physical speed limit of the disk (such as double-disk rotation to speed up the rotation), it has never been able to break through the physical limit of the design of mechanical disks .

Why are 7200 RPM disks and 5400 RPM disks sold in the mainstream market instead of other disks?

On the one hand, the random read and write performance of the 7200 has been tested to be the best. At the same time, when discussing the performance of a disk, it is not the speed of sequential read and write but the speed of random read and write.

There are some physical barriers to mechanical disks. Since Toshiba developed flash memory in 1984, flash memory technology has continued to improve. After five years, in 1989, SSD disks gradually entered the stage of history.

⚠️Note: Why are the strange numbers such as 7200 rpm and 5400 rpm?
These two numbers should start from 3600. In the first ten years of the computer, almost all hard disks were 3600 RPM. Where did this 3600 come from? Because the AC in the United States is 60Hz ! So there is the following formula:
60Hz × 1 rev/Hz × 60 sec/min = 3600 rev/min
5400 RPM = 3600 RPM × 1.5
7200 RPM = 3600 RPM × 2
Another reason is patent competition . You will find that the speed has 15,000 but no such numbers as 10,000, 9,000, 8,000. In fact, it is all because the integer and multiples of 500 are registered by the patent, but the patent registrant does not expect the speed. Can break a million.

SSD Disk Introduction

There are two types of SSD hard drives, one is a flash solid state drive based on flash particles, and the other is a DRAM hard drive .

The hard disk of flash memory particles is also the hard disk used by most of our modern notebook computers and mobile solid state. The biggest advantage of this kind of disk is that it can be moved, and the protection of data can be stored without relying on power supply. It is usually divided into QLC and MLC in flash memory particles. , TLC, even the QLC hard drive with the shortest lifespan has a lifespan of 5-6 years, while the MLC has the longest lifespan and can often work normally for more than ten years if it is properly protected.

⚠️Note: Solid-state drives used to be very expensive, so mechanical disks were the mainstream, and they have only been widely used in the past few years, so the above-mentioned content is ideal.

The solid state used in enterprise-class servers is usually dominated by MLC. The biggest feature of SSD disks is that they are not physically impacted like HDDs, which may cause the entire disk to become unavailable. However, once SSDs damage data, the repair cost is very high or Can't be fixed at all.

SSDs are very cheap now, but HDDs are still popular with some users for their low-cost storage of big data.

DRAM is a form between mechanical disks and solid-state hard disks. It uses DRAM as a storage unit. It imitates the design of traditional hard disks. It can be set and managed by file system tools of most operating systems, and provides industry-standard PCI. And the FC interface is used to connect the host or server, but the biggest problem is that the application range is relatively narrow .

HDD data read method

The disk read data sequence of HDD is as follows:

The device will tell the disk which sectors and device numbers it needs to read and write and how many sectors to scan
Move the head and turn the platter to find the corresponding sector.
Read data, write data to buffer.
The read operation is considered complete if all sector scans are complete.

The main points of HDD disk read and write

From a logical point of view, the calculation and processing speed of calculating the sector scan position and the number of scans is very fast, and the data read or written in the sector is also relatively fast.

However, we know that because of the speed limit and the physical scanning of the magnetic head and the disk is very slow, the performance bottleneck of the entire read and write is the physical disk overhead required for the head sector addressing and scanning the disk, and the random read and write performance finally brought. trade off.

read and write

Several situations of disk scanning The method of disk scanning directly determines the speed of data processing and reading:

Sequential scanning : Sequential scanning is to directly draw several consecutive sectors on a track, and data can be obtained in one scan, so it is very fast.
Multiple consecutive sequential scans : Continuous sequential scan is to perform multiple scans for several consecutive sectors. This time overhead is mainly in the rotation of the disc. Although it is still relatively fast, the rotation of the disc still produces a certain delay.
Random reading and writing: The overhead of random reading and writing is mainly on the track back and forth addressing. At this time, not only may the disk rotate, but the head also needs to find scattered sectors. The efficiency of random reading and writing is very low.

⚠️Note: For a single IO access, if the amount of data obtained exceeds the upper limit of the amount of data requested by the disk, the request will be divided into multiple sequential scans from a single sequential read and write .

Factors Affecting Hard Drive Performance

Disk seek time: The average seek time of a disk is generally 3-15ms.
Disk rotation speed: the faster the rotation speed
The read and write performance of the disk itself: It is also related to the design and manufacturer of the disk. HDD hard disks with strong random read and write usually have better IO performance. At the same time, the faster the disk data transmission, the greater the transmission volume, the better the effect (nonsense).

Mechanical disks need to pay attention to seek time and rotational delay . Of course, the weak read and write performance of HDD is actually similar.

generic block layer

General block layer: refers to the abstraction of HDD and SSD in Linux system.

HDD and SDD they are called block devices . There are two ways to access block devices. The first is to directly read and write through device files by mounting. The second is to encapsulate the disk according to the file system and provide a boot entry. Of course, most software uses the second. a way.

Due to the different processing methods of different types of block devices, the processing of these devices needs to rely on the control of the driver to achieve access, but in our daily experience of using the Windows system, it is impossible to be a block device and a driver , otherwise we will join every time. The new hard drive has to be installed again, which is too troublesome.

So how does the operating system solve this problem? This is what the generic block layer does:

The interaction process between Linux and the device

There are a lot of details in this flow chart. You can write a long article by extracting any layer. Here we focus on simply understanding the IO scheduler and the disk pre-reader mechanism.

IO scheduler and read-ahead mechanism

There are also two mechanisms in HDD that greatly affect performance, IO scheduler and pre-reading mechanism , but note that pre-reading for HDD is much better than SSD's pre-reading, because sometimes sorting and pre-reading can easily lead to SSD's Negative optimization.

⚠️Note: Sometimes when we upgrade the old device and replace the mechanical disk with a solid state, a black screen may appear when the system is reinstalled. This is because some old motherboards will tune the mechanical disk through the BIOS, similar to "preheating" the disk. However, this kind of Preheating will affect the startup of the solid state, so if a similar situation occurs, you can check whether the BIOS has a similar option to accelerate mechanical disk startup.

IO scheduler

IO scheduler: It makes a request after accumulating requests for a certain period of time when accessing a block device.

Therefore, there are two important tasks for the IO scheduler:

Merge: Combine IO requests for access to multiple sectors into one request.
Sorting: Because each sector has a number, the IO scheduler will sort the IO accessed by consecutive sectors before accessing it, making the disk scan closer to sequential scan.

⚠️Note: The IO scheduler is generally used for concurrent thread reading and writing or when waiting for IO results in the asynchronous IO process.

read-ahead mechanism

When the magnetic head of the disk scans data, it will not only scan a few sectors required by the device, but will scan more surrounding sectors. Note that this pre-reading only works during sequential scanning , when the pre-reading mechanism takes effect If it is found that the pre-read sector will not be used in the next access, it can be discarded directly.

Linux file system design

How to design a simple file system

Considering the design of a basic file system from the simplest point of view, we can use a conventional file reading and writing example.

The simplest file system consists of the following processing flow:

First, the file data is recorded from 0, and each file has three basic information of name, size and location in the file system.
Without the assistance of the file system, we need to consider the disk storage location of the file by ourselves. We need to store from disk area 1 to area 10 to the corresponding location of the block device according to the size of the file, and we need to record the start, end and end of the file writing of the current block. , which records the size of the data stored.

Why is there a "state"?

Obviously, in the early single-process single-user operating system, the concept of state does not exist. However, with the emergence of processes and users, computers at that time were faced with an important problem, that is, how to restrict the operation permissions of different processes .

Not all processes can allow all external users to operate, because they do not know which new processes will operate in the future, so engineers prepare to allow some dangerous operations to only allow the operating system to separate the system and user processes. If the user process wants to do some dangerous operations, it must be "interrogated" by the operating system, and then the operating system will do it.

The final "state" is designed in the following form:

Mode switch

The following is the relationship between the user mode and kernel mode hardware in Linux:

User mode

It is an operation visible to the user, such as wanting to read a certain file or want to change a few words in a certain file. These instructions sent by the user are converted into machine code commands through the kernel, and then IO operations are performed on the disk in the kernel state.

That is to say, user mode: send command operation -> kernel mode, translate user commands into commands that can be recognized by block devices -> hardware.

In the simplest file system designed above, switching from user mode to kernel mode for file management only needs to care about file size, file location and file name.

Kernel state

Only the operating system with system operating authority has the right to operate and access, generally to perform some core work of the system.

hardware

Hardware is also called external storage, and these operations only interact with the kernel.

Linux file system structure

The file system of Linux is a tree-like structure design . The file system can support different formats. The differences between different file systems are mainly in the maximum supported operation file size , the size of the file system itself, and the difference in the speed of various file operations.

There are ext2, ext3, and ext4 in the file system, and their file size, storage form and storage location are different, so how does Linux handle it?

The Linux file system will abstract the file IO operation in the form of an interface. No matter how the structure of the file system changes, the interaction is finally completed through the related interfaces mentioned below.

Tucao: It's a bit like a design pattern's appearance pattern

Create Delete: create/unlink
Open Close: open/close
Open file to read data: read
Open file to write data: write
Open the file and move to the specified location: lseek
Special file system special operations: …

This is similar to Linux block device management. In the introduction to Linux and external structures, it was mentioned that block devices provide a general layer of block abstraction for the file system. There is no difference between the single-block devices seen from the user mode perspective, while The real driver processing occurs in kernel mode.

Read file data process

The process of reading a file in Linux is as follows:

Common processing for each file system.
File system-specific processing, requesting system calls corresponding to processing instructions.
Device drivers perform read and write data operations.
The block device driver completes the read and write command operations.

From the logical structure, the entire interaction process is very simple, but in fact this is the result of the continuous efforts of Linux engineers.

data and metadata

In Linux, the types of data are divided into metadata and data . Metadata is related to file name, file size, and file location. These parameters are used to read block devices in the kernel state as a reference, while data is the video data we use every day, text. data.

In addition to the above types of information, metadata also includes the following:

Type: Determine whether the file is a common file that saves data, a directory, or other files, that is, the file type.
Time information: file creation time, last access time, last modification time.
Permission information: Linux permissions control user access.

We can use the df command and related parameters to understand the parameters and operation of the file system in detail. df is an important operation and maintenance command, which can be used to know the capacity of the disk.

 g@192 ~ % df

Filesystem                                                   512-blocks       Used Available Capacity  iused      ifree %iused  Mounted on

/dev/disk3s1s1                                                965595304   29663992  73597064    29%   500637  367985320    0%   /

devfs                                                               711        711         0   100%     1233          0  100%   /dev

/dev/disk3s6                                                  965595304         48  73597064     1%        0  367985320    0%   /System/Volumes/VM

/dev/disk3s2                                                  965595304    1034480  73597064     2%     2011  367985320    0%   /System/Volumes/Preboot

/dev/disk3s4                                                  965595304      31352  73597064     1%       48  367985320    0%   /System/Volumes/Update

/dev/disk1s2                                                    1024000      12328    985672     2%        3    4928360    0%   /System/Volumes/xarts

/dev/disk1s1                                                    1024000      15040    985672     2%       27    4928360    0%   /System/Volumes/iSCPreboot

/dev/disk1s3                                                    1024000       1240    985672     1%       39    4928360    0%   /System/Volumes/Hardware

/dev/disk3s5                                                  965595304  859395304  73597064    93%  1212174  367985320    0%   /System/Volumes/Data

/dev/disk6s1                                                 1000179712  807402240 192777472    81%  3153915     753037   81%   /Volumes/Untitled

/dev/disk7s1                                                 1953443840 1019557888 933885952    53%   497831     455999   52%   /Volumes/Extreme SSD

map auto_home                                                         0          0         0   100%        0          0  100%   /System/Volumes/Data/home

//GUEST:@Windows%2011._smb._tcp.local/%5BC%5D%20Windows%2011  535375864  212045648 323330216    40% 26505704   40416277   40%   /Volumes/[C] Windows 11.hidden

/dev/disk5s2                                                     462144     424224     37920    92%      596 4294966683    0%   /private/var/folders/wn/dvqxx9sx4y9dt1mr9lt_v4400000gn/T/zdKbGy

disk quota

Capacity quota is the core of disk management. For the Linux file management system, there are the following disk quota methods:

User quota: User quota usually refers to /home, usually each user's home directory has a fixed quota ratio
Subvolume quota: Limit the available capacity of units named subvolumes.
Directory quota: You can set directory quotas through the available capacity of a specific directory, such as the user's available capacity for shared directories, ext4 and xfs.

In addition to the quotas for different types, it is also necessary to consider the system quotas for the normal operation of the system. That is to say, it is a dangerous operation to divide the disks in a one-size-fits-all manner with a 100% ratio. Keeping the disk at no more than 80% is a more common setting.

Accidental file system recovery

The most common problem of data management is data inconsistency, such as a sudden power failure when writing is not completed. This situation is not particularly uncommon. Linux provides the following two ways to solve the problem of inconsistent data status after power failure:

Journal : Usually found on ext4 and xfs filesystems.
Copy-on-write : usually an implementation of btrfs.

log mode :

The use of log processing is more, because the log method has a certain readability and is convenient for recovery. The operation is mainly divided into the following two steps:

Atomic operations are logged before data modification.
When recovering from a crash, restore the file state according to the log content.

If the exception occurs before the log is recorded, you can directly discard the written part of the log and roll back, as the file state has not changed. If the abnormal state occurs after the process of the atomic operation, the operation can be re-executed according to the record of the log.

Modern file systems are mostly caused by data inconsistencies caused by bugs in the system. Modern SSD hard drives are mostly used, and SSD hard drives are very fast to write, and there is basically no data inconsistency when writing.

Copy-on-write method :

Different file systems have different implementations of copy-on-write. To introduce copy-on-write, you need to understand some traditional file systems such as ext4 and xfs.

These file systems will be fixed to a certain location on the disk after the file is created, and even if the file content is deleted or updated, it will only operate on the original space.

The copy-on-write file system management scheme of btrfs is quite special, and the files after creation will be placed in different locations each time they are updated.

Copy-on-write means that both updating and writing are a "copy" operation. When the new data is written, the reference can be updated, and the original content can still be found as long as it is not overwritten by the new file.

What if the power is suddenly cut off while writing?

At this time, the data is operated in another place, and the data will not affect the old data when half of the data is written. If it is another operation, such as the case where the reference is not updated after the writing is completed, you only need to update the reference at this time. That's it. In short, it will not affect the original data.

⚠️Note: In fact, the disk does not have the concept of deletion in nature. The so-called deletion of the computer is just that the user process cannot access the address of the deleted file through normal operations, but through some special processing, there is still a way to restore the original file through file fragmentation.

unrecoverable accident

If it is an accident that the BUG of the file system cannot be recovered, the processing scheme is different for different file systems.

Almost all filesystems have a common fsck command to restore, but this command defines that it is possible to restore the data state.

The following is an introduction to this command:

fsck command

The Linux fsck (English full spell: file system check) command is used to check and repair Linux file systems, and can check one or more Linux file systems at the same time.

grammar
fsck [-sACVRP] [-t fstype] [--] [fsck-options] filesys [...]

Parameters :

filesys : device name (eg./dev/sda1), mount point (eg. / or /usr)
-t : The type of the given file system, if it is already defined in /etc/fstab or supported by the kernel itself, this parameter does not need to be added
-s : Execute fsck instructions one by one to check
-A : check all partitions listed in /etc/fstab
-C : show full check progress
-d : print the debug results of e2fsck
-p : When the -A condition is present at the same time, multiple fsck checks are executed at the same time
-R : Omit/do not check when -A condition is also present
-V : verbose display mode
-a : auto-fix if check is wrong
-r : If the check is wrong, the user will answer whether to fix it

Case :

Check msdos file system /dev/hda5 is normal, if there is any abnormality, it will be automatically repaired:

 fsck -t msdos -a /dev/hda5

Problems with the fsck command

However, this command seems to be very powerful, but there are some serious performance problems:

Walk through the file system and check the consistency of the file system and also repair inconsistencies. The file system is very large and the recovery speed can take hours or days.
If recovery fails midway, it's not just the machine that crashes.
A successful repair will not necessarily restore the desired state, and all inconsistent data and metadata will be deleted.

Brief Analysis of Computer Storage Hierarchy

Introduction to Storage Components

First, let's take a look at the introduction of different storage levels, including the above-mentioned registers, cache, memory and the relationship between them.

Let's take a look at the storage hierarchy diagram as a whole:

Note ⚠️: These parameters in small characters are relatively old now, we only need to simply understand the progression from the fastest to the slowest to the slowest from left to right.

cache:

Cache is a small but very fast memory located between the CPU and main memory .

After the data in the memory is read, the data will not directly enter the register, but will be stored in the cache first. The cache is usually divided into three layers. The size of the read depends on the size of the cache block, and the read speed depends on the high-speed of different levels. The capacity of the cache.

The execution steps of the cache are as follows:

Read the data into the register according to the instruction.
registers for calculation operations.
Transfer the result of the operation to memory.

In the above three steps, there is basically no transmission consumption for registers and caches, but the transmission of memory is much slower, so the bottleneck of the entire process operation is also the transmission speed of memory, which is why the cache is used to solve the problem of registers and memory. huge difference between.

The cache is divided into L1, L2, and L3. Before describing the theoretical knowledge, here is an example to facilitate understanding:

L1 cache: It's like needing the tool to be readily available on our belt, so the steps to get it are the easiest and fastest
L2 cache: It's like putting the needed tools in the toolbox. If we need to get it, we need to open the toolbox first and then hang the tools of the toolbox on the waist to use it. Why can't you just take it out of the toolbox and put it back in? In fact, think about how tiring it is if you need to use it frequently. In addition, although the toolbox is much larger than the space on the waist, it is not very large, so the L2 cache is not much larger than the L1 cache.
L3 cache: L3 is much larger than L1 and L2. It is equivalent to a warehouse. To obtain data, we need to go to the warehouse to find the toolbox and put it next to us, and then execute it like the above, although the warehouse has a large volume. , but requires the most steps to operate and the largest time overhead.

Below L1 cache is L2 cache, and below L2 is L3 cache. According to the above introduction, both L2 and L3 have the same problems as L1 cache. They need to be locked and synchronized, and L2 is slower than L1, and L3 is slower than L2.

Here we give an example to briefly describe the internal operation of the cache:

Assume that the cache block to be read is 10 bytes, the cache is 50 bytes, and the registers of R0 and R1 are 20 bytes in total. When R1 needs to read the data of a certain address, the first read When the data is loaded, 10 bytes are first loaded into the cache, and then transferred from the cache to the register. At this time, R0 has 10 bytes of data. If you need to read 10 bytes next time, it is also because the cache finds the cache. If there is the same data, read 10 bytes directly from the cache into R1.

So what if the R0 data is rewritten at this time? First, the CPU will rewrite the value of the register first, and after rewriting the value of the register, it will rewrite the value of the cache at the same time. At this time, if there is cached block data from the memory, these values will be marked in the cache first, and then the cache will be at a certain time. Synchronize the rewritten data into memory.

What happens if the system runs out of cache?

First, the cache will eliminate the least used cache at the end according to the cache elimination mechanism, but if the "dirty" speed of the cache is fast and the capacity of the cache is always insufficient, it will happen that the memory is frequently written to the cache and constantly In the case of changing caches, there is the potential for perceptible system thrashing.

Note⚠️: The content discussed in this part is all write-back. The rewriting method is divided into write-through and write -back. There is a certain delay in the cache for write-back. The time accumulation method is used to periodically rewrite the memory to refresh the memory synchronously. The write-through method will immediately rewrite the memory value at the moment the cache changes.

How to measure the limitations of access?

Almost all programs can be divided into the following two cases:

Time limitation : The cache may be accessed once within a certain period of time, but it can be accessed again after a short period of time. The common situation is that the value is continuously retrieved in a loop.
Space limitation : When accessing a piece of data, you also need to access the data around it, which is somewhat similar to the read-ahead mechanism of a disk.

If the process can measure and control the above two points, then it can basically be considered as an excellent program, but the reality is often not the case.

summary:

Caching is a design that does far more good than harm.
The main issues with data inconsistency and data synchronization cache performance impact.
Once the cache is full, there will be a certain delay in the system processing speed.

register:

The registers include the instruction register (IR) and the program counter (PC), which are part of the central processing unit. The registers contain the instruction register , program counter and accumulator (mathematical operation).

ARM takes a simple instruction set, and X86 takes a complex instruction set. Although X86 has come to an end from now on, it still dominates the market.

The complex instruction set will contain a lot of registers to complete complex operations, such as the following registers:

general purpose register
flag register
instruction register

If you are interested, you can use the register as an entry into the X86 architecture.

RAM:

Memory is not just computer memory as we know it, it also includes read-only storage, random access storage and cache storage in a broad sense.

Here may be a question why memory is used the most but not as good as registers and caches?

Because the memory not only needs to communicate with the CPU, but also needs to deal with other controllers and hardware. The more things you manage, the lower the efficiency. At the same time, if the memory is tight, the CPU needs to wait for the memory transfer. Of course, this can also explain why high speed is required. caches and registers.

In addition to the reasons mentioned above, there is also a more critical reason that the bus bandwidth of the motherboard is limited and needs to be shared by various channels, such as the South Bridge and some other external devices, etc. At the same time, the bus also needs to be preempted , not for sharding.

other supplements

translation lookaside buffer

The following content is from the explanation of Wikipedia:

Translation Lookaside Buffer (English: Translation Lookaside Buffer, acronym : TLB), also commonly known as page table cache , transfer lookaside cache , is a cache of the CPU , used by the memory management unit to improve virtual addresses Translation speed to physical addresses.

All current desktop and server processors (eg x86 ) use TLB. The TLB has a fixed number of space slots for tab page table entries that map virtual addresses to physical addresses . For typical combined storage (content-addressable memory, acronym : CAM).

The search key is the virtual memory address, and the search result is the physical address. If the requested virtual address exists in the TLB, the CAM will give a very fast match and the resulting physical address can then be used to access the memory. If the requested virtual address is not in the TLB, the tag page table is used for virtual-real address translation, and the access speed of the tag page table is much slower than that of the TLB.

Some systems allow the tab page table to be swapped to secondary storage, so virtual-to-real address translation can take a very long time.

If a process wants to access special data, it can access the logical address in the following ways:

The virtual address is converted to the physical address by looking up the table against the physical page table.
Find the actual physical address by accessing the corresponding physical address.

Note ⚠️: The operation here is similar to the access operation of the second-level pointer. If you want the cache to function, it must be the first-level pointer search, but the second-level pointer search is meaningless.

To put it bluntly, the translation lookaside buffer is a special space used to accelerate the translation of virtual addresses to physical addresses, in order to improve the speed of multi-level nested mapping search.

page cache

Note that the content mentioned above is the page table cache, here is the page cache .

What is the role of page caching? We all know that the external hardware storage speed is the slowest. Usually, the application program operates the data in the hard disk by loading the data into the memory in advance and then performing the operation. However, the data is not directly copied from the disk to the memory, but is stored in the memory and the memory. There is an additional layer of page caching between external storage devices.

The steps to read the page cache are as follows:

The process reads the disk text data, finds the relevant data and loads the content into the page cache.
The contents of the page cache are copied to the memory, and the physical data is consistent with the memory and the page cache data.
If the file text data needs to be rewritten, the page cache will first be notified to mark itself as a "dirty page".
If the memory is insufficient, free the free page cache for memory use.
If the page cache and memory are insufficient, it is necessary to flush the "dirty pages" to free up space for the memory to continue to use.
Under normal circumstances, the page cache will periodically flush the cache and write back to the disk to keep the data synchronized.

In addition, it should be noted that if the page cache has no process access, the page cache will continue to "expand". If the page cache and memory are always insufficient, dirty pages will be continuously written back and performance jitter problems will occur.

buffer cache

It is easy to confuse the buffer cache with the page cache. We only need to simply understand that it is the temporary storage of the original disk blocks , that is , the data used to cache the disk , such as device files directly connected to external storage devices, U disk read and write and external Disk reads and writes, etc., these reads and writes are managed through the buffer cache.

It should be noted that the buffer cache is usually not very large (about 20MB), so that the kernel can collect scattered writes and optimize disk writes uniformly. For example, multiple small writes can be merged into a single large write, etc. .

Tuning parameters in Linux

After understanding the content and details of the above components, let's look at a few simple Linux tuning parameters.

write back cycle

The write-back period can be adjusted by the sysctl vm.dirty_writeback_centisecs parameter, but note that the unit of this value is special, centiseconds , this parameter is set to 500 by default, that is, a write-back is performed every 5 seconds.

Centisecond (English: centisecond , symbol cs ), 1 centisecond = 1/100th of a second .

Of course, do not set this value to 0 unless for experimental understanding.

In addition to this parameter, there is also a percentage parameter. When the number of dirty pages exceeds the percentage, the dirty page write-back operation will be triggered to prevent severe performance jitter. 10 in the following case represents 10%.

Here is the content of this parameter:

 vm.dirty_backgroud_ratio = 10

If you want to control this threshold in the form of bytes, you can specify it through the parameter vm.dirty_background_bytes , if this parameter is 0, it means that this configuration is not enabled.

Dirty pages are not allowed to exist all the time. If the dirty pages accumulate to a certain amount, the kernel will trigger a write-back operation, which can be controlled by vm.dirty_ratio . When this percentage is reached, the kernel will block the user process and remove all the dirty pages. Page write back.

vm.dirty_ratio In addition to setting the percentage parameter, you can also pass the byte limit, the parameter is vm.dirty_bytes .

In addition to these less commonly used parameters, there are some more special tuning parameters,
For example, the operation method configured to clear all page caches is to write 3 to /proc/sys/vm/drop_caches , why is 3 written, here is left to the reader to find the answer.

hyperthreading

Hyper- Threading (HT, Hyper-Threading) is a technology developed by Intel and released in 2002. Hyper-threading technology can disguise one core as two cores. At the same time, for a single-core CPU, you can also enjoy the benefits of simulating dual cores. Of course, hyper-threading technology is not only beneficial, but also has an obvious disadvantage of multi-threaded preemption. And the overhead brought by the thread context, and even in the most ideal situation, the hyper-threading technology can only improve by 20% -30% at most, but this optimization is for the technical optimization and performance improvement effect when the technical strength is limited in that year. is very significant.

<s>From then on, the toothpaste factory has gone to the point of no return to squeeze toothpaste</s>

summary

This part of the content is more like a slightly in-depth introduction to several common core components of home computers. Learning these contents is not only necessary for an in-depth understanding of computers, but also for our daily selection of computers. Some business instructions can also have a deeper understanding.

In other supplementary parts, three caches are introduced, namely translation backup buffer, page buffer, and buffer buffer. Although the names of these three are similar, the internal work is still quite different. After the introduction of the three, I introduced some information about Linux. Tuning parameters.

For in-depth X86 architecture, it is more critical to understand the core working mechanism of each register, and the simplified instruction set used by ARM in the future is more suitable for the entire ecological development.

Linux memory management

a brief introdction

Let's briefly introduce Linux memory management. In Linux, memory management can be roughly understood as three parts:

memory used by the kernel
memory used by the process
Free memory (free memory)

Except the memory used by the kernel to maintain the normal operation of the system and cannot be released, the rest can be freely controlled by the operating system. In Linux, we have the free command to specifically check the memory usage. The effect of execution is similar to the following:

 /opt/app/tdev1$free
             total       used       free     shared    buffers     cached
Mem:       8175320    6159248    2016072          0     310208    5243680
-/+ buffers/cache:     605360    7569960
Swap:      6881272      16196    6865076

The meaning of each column is as follows:

total: The total amount of physical memory on the system, such as 8G above.
free: Surface available memory.
buff/cache: buffer cache and page cache. In the analysis of computer storage hierarchy , it is mentioned that when memory is not enough, it can be used to release cache to make room for memory use.
available: The actual memory that can be used, the calculation formula is very simple, namely 内核之外的可用总内存 - （free + buff/cache 最大可以释放的内存） .

In addition to the column data, there is a row swap , the meaning of this parameter will be introduced later.

In addition to the free command, Linux also has the sar -r command. This parameter can be used to specify the collection period. For example, -r 1 is collected once every second.

The computer I currently use is a Mac. Although it is a Unix-like system, there is no free related commands. For this purpose, the following commands can be used for simple substitution, but it is not as powerful as free.

Use top -l 1 | head -n 10 in Mac to view the overall system operation.

 MacBook-Pro ~ % top -l 1 | head -n 10

Processes: 604 total, 2 running, 602 sleeping, 3387 threads 

2022/04/15 17:29:57

Load Avg: 2.84, 3.27, 5.68 

CPU usage: 6.8% user, 14.18% sys, 79.72% idle 

SharedLibs: 491M resident, 96M data, 48M linkedit.

MemRegions: 168374 total, 5515M resident, 235M private, 2390M shared.

PhysMem: 15G used (1852M wired), 246M unused.

VM: 221T vsize, 3823M framework vsize, 0(0) swapins, 0(0) swapouts.

Networks: packets: 312659/297M in, 230345/153M out.

Disks: 788193/14G read, 161767/3167M written.

In addition, you can use diskutil list in Mac:

~ > diskutil list

/dev/disk0 (internal):

#: TYPE NAME SIZE IDENTIFIER

0: GUID_partition_scheme 500.3 GB disk0

1: Apple_APFS_ISC 524.3 MB disk0s1

2: Apple_APFS Container disk3 494.4 GB disk0s2

3: Apple_APFS_Recovery 5.4 GB disk0s3

/dev/disk3 (synthesized):

#: TYPE NAME SIZE IDENTIFIER

0: APFS Container Scheme - +494.4 GB disk3

Physical Store disk0s2

1: APFS Volume mysystem 15.2 GB disk3s1

2: APFS Snapshot com.apple.os.update-... 15.2 GB disk3s1s1

3: APFS Volume Preboot 529.6 MB disk3s2

4: APFS Volume Recovery 798.6 MB disk3s3

5: APFS Volume Data 455.3 GB disk3s5

6: APFS Volume VM 24.6 KB disk3s6

/dev/disk6 (external, physical):

#: TYPE NAME SIZE IDENTIFIER

0: GUID_partition_scheme *512.1 GB disk6

1: Microsoft Basic Data 512.1 GB disk6s1

/dev/disk7 (external, physical):

#: TYPE NAME SIZE IDENTIFIER

0: GUID_partition_scheme *1.0 TB disk7

1: Microsoft Basic Data Extreme SSD 1.0 TB disk7s1

 下面是`free`和`sar`这两个命令的输出结果对应关系：

> - total ： 无对应
> - free ：kbememfree
> - buff/cache ：kbbufferrs + kbcached
> - available：无对应

如果内存使用过多，系统为了空出内存可能出现强制 `kill` 某个进程的操作，此操作是随机的并且无法被监控，商用机器上执行这种操作是十分危险的，所以有部分的商用机器会开启一旦OOM直接把整个系统强制关闭的操作。


## 内存分配方式及问题
内核分配内存的时机大致有下面两种：
1. 创建进程。
2. 创建进程之后进行动态内存分配。

在进程创建之后如果进程还需要内核提供更多的内存，则可以向内核发出内存的请求申请，内核收到指令之后，则划分可用内存并且把起始结束的地址给进程进行使用。

但是这种要一点给一点的方式有下面几个常见的问题：
- 难以执行多个任务。
- 访问其他用途的内存区域。
- 内存的碎片化。

> 注意⚠️：内存不仅仅需要和CPU通信还需要和其他的控制器和硬件打交道，分配内存给进程只是诸多任务的项目之一。

**难以执行多任务**

可以理解为进程频繁的需要申请内存的情况，这时候内核需要不断的操作分配内存给进程，整个任务相当于被单个进程给拖累了。

另外如果多个任务出现分配内存的区域刚好相同，此时需要要完成内存分配给那个进程任务，则另一个进程等待也是可以理解的。

**内存碎片化**

原因是进程每次获取内存都需要了解这部分内容要涵盖多少区域否则就不能获取这些内存。

内存碎片化的另一个重大问题是明明有很多富裕的内存但是却拿不出一块完整连续的空间给进程使用，导致不断的回收和分配操作。

**访问其他用途的内存区域**

这种情况进程访问被叫做缺页访问中断，在后续的内容会进行介绍。

## 虚拟地址和物理地址
为了解决上面的问题，操作系统使用了虚拟内容和物理内存的方式进行内存管理。

我们需要了解三个概念：**地址空间、虚拟地址、物理地址**。

地址空间：指的是可以通过地址访问的范围都统称为地址空间。

虚拟地址：虚拟地址指的是进程无法直接访问到真实的物理内存地址，而是访问和实际内存地址映射的虚拟内存地址，目的是为了保护系统硬件安全。

物理地址：也就是我们实际内存对应的实际的物理地址。

这里举一个简单的例子：如果内核给进程分配100地址的虚拟内存地址，那么这个虚拟内存地址实上可能会指向实际的600物理地址。

**页表**

完成虚拟地址到物理地址的映射依靠的是页表，在虚拟内存当中所有的内存都被划分为页，一个页对应的数据条目叫做**页表项**，页表项记录物理地址到虚拟地址的映射关系。

在x86-64的架构中一个页的大小为4KB，进程在内存是有固定的起止地址的，那么如果出现超出地址的页访问，也就是访问了没有虚拟地址和物理地址映射的空间会出现什么情况呢？

如果出现越界访问，那么此时CPU会出现**缺页中断**，并且终止在缺页中进行操作的进程指令，同时启动内核的**中断处理机构**处理。

> 注意⚠️：对应**访问其他用途的内存区域**这个问题。

## 虚拟内存分配操作
虚拟内存的分配操作步骤我们可以理解为几个核心的步骤：
- 内核寻找物理地址并且把需要的物理地址空间计算。
- 创建进程的页表把物理地址映射到虚拟地址。
- 如果进程需要动态内存管理，内核会分配新页表以及新的可用内存给进程使用，当然同时提供对应的物理内存空间。

> 物理分页使用的是请求分页的方式进行处理，这个分配的操作十分复杂。

**内存的上层分配**

在C语言中分配内存的函数是`malloc`函数，而Linux操作系统中用于分配内存的函数是`mmap`函数，这两者最大区别是`mmap`函数使用的是**按页**的方式分配，而`malloc`是按照**字节**的方式分配。

`glibc`通过系统调用`mmap`申请大量的内存空间作为内存池，程序则调用`malloc`内存池请求分配出具体的内存供进程使用，如果进程需要再次获取内存则需要再次通过mmap获取内存并且再次进行分配操作。

在上层编程语言也是使用了类似的操作，首先通过`glibc`向内核申请内存执行虚拟内存的分配操作，然后`malloc`函数再去请求划分具体的内存使用，只不过更上层的语言使用了解析器和脚本进行掩盖而已，实际上通过层层翻译最终的操作依然是上面提到的操作。

**虚拟内存是如何解决简单分配的问题的？**

这里我们再次把上面三个问题搬出来，再解释虚拟内存是如何处理问题的：
- 难以执行多个任务：每个进程有独立的虚拟地址空间，所以可以编写专用地址空间程序防止多个任务阻塞等待的情况。
- 访问其他用途的内存区域：虚拟地址空间是进程独有，页表也是进程独有。页表的另一个作用是限制可以防止当前的进程访问到其他线程的页表和地址空间。
- 内存的碎片化：内存碎片化使用页表的方式进行分配，因为页表记录了物理地址到虚拟地址的映射，这样就可以很好的知道未使用的空间都干了啥。

虚拟内存的其他作用：
- 文件映射
- 请求分页
- 利用写时复制的方式快速创建进程
- 多级页表
- 标准大页

**小结**

这一部分简要阐述Linux内存管理的入门理解部分，这一部分主要介绍了简要的内存分配方式，以及Linux对此通过页表的方式实现物理地址和虚拟地址的分配，最后阐述了操作系统和编程语言也就是进程之间是如何分配内存的，具体的分配步骤和交互逻辑介绍。

# Linux内存管理优化

## 文件映射
经过之前的内容我们了解到文件映射通过映射虚拟内存的方式实现，进程访问内存对时候实际是文件对应的副本虚拟内存地址，既然访问虚拟内存位置可以完成文件的修改映射，那么直接访问物理内存也就是实际内存修改内容也是可行的。

如果知道文件的具体地址，甚至可以直接定位到内存地址对于内容进行覆盖，在书中有一个C语言写的验证程序比较有意思。

## 请求分页
进程向内核申请内存的通过**请求分页**的方式完成，之前提到过通过`mmap`的方式申请内存的方式虽然很方便但是是有问题的：

通常的内存分配方式有下面两种：
-   物理内存的直接申请和分配，高效。
-   句柄分配的方式，也就是页表对于虚拟内存和实际内存映射之后再给进程。

这两种分配方式都存在两个比较明显的问题，那就是分配的时候如果申请了却没有使用会大量浪费，另外一次glibc访问需要超过进程的内存，但是进程此时很可能不会使用甚至

"How Linux Works" Reading Notes