Write OS kernel from scratch-lock and multi-thread synchronization

Series catalog

Multithreaded competition

Last We finally started to run multithreading, and initially established the task scheduling system scheduler , so that this kernel finally began to show the appearance of an operating system. On the basis of multi-threads operation, we need to enter the user mode to run threads, and establish process , and load the user executable program.

However, before this, an important and dangerous problem has come along with the operation of multi-threads, which is the problem of thread competition and synchronization. I believe you should have programming experience related to multithreading in user mode, understand the problems and reasons of competition and synchronization between threads, and the concept and use of lock This article will examine and discuss lock and code implementation kernel It should be noted that lock is a huge and complex subject, and there will be many differences in the use and implementation of the lock in the kernel and in the user mode (of course there are most of the commonalities), this article is just my personal superficiality The understanding and realization of the, welcome to discuss and correct.

Lock

The data problem caused by the competition between threads, I don't think I need to explain more here. After the threads of the first two articles are started up and running, we should be able to clearly realize that interrupt may happen at any time and at any instruction. Any non-atomic operation will cause data competition between threads ( race ).

For our current kernel, there are actually many places where lock needs to be added to protect access to public data structures, for example:

page fault processing function, allocates the bitmap of the physical frame, which obviously needs protection;
kheap , all threads are digging memory inside;
scheduler in a variety of task queue, as ready_tasks ;
......

In most programming languages that support multithreading, there will be lock-related concepts and tools. As a poor kernel project, we need to implement it ourselves.

lock is a complicated subject. On the basis of safety first, the quality of the design and implementation and the way of use will greatly affect the performance of the system. Bad lock design and use may lead to unreasonable scheduling of threads and a large waste of CPU time, reducing system throughput performance.

Next we from lock starting underlying principle, to discuss several common lock classification and implementation, as well as their usage scenarios.

Atomic instruction operation

Logically speaking, lock is very simple:

if (lock_hold == 0) {
  // Not locked, I get the lock!
  lock_hold = 1;
} else {
  // Already locked :(
}

Here lock_hold saves whether the current state is locked, the value is true / false . The person who tries to get the lock first checks whether it is 0. If it is 0, it means that the lock has not been held by other people. Then I can get the lock and set it to 1, which means it is locked to prevent someone from getting the lock later.

However, the error of the above implementation is that the judgment of the condition of if lock_hold = 1 below are not atomic. Both threads may be lock_hold when 0611544ecee9aa is 0 and the other party has not been lock_hold = 1 to modify 0611544ecee9ac in the if . Pass, get the lock together and enter.

The core problem here is lock are not atomic, which means that they are not completed by one instruction, so the cross operation of two threads in it may cause data race.

Therefore, for any lock, the lowest level implementation must be an atomic instruction, that is, use an instruction to complete the check and change , which ensures that only one thread can pass the instruction smoothly, while blocking the others. E.g:

uint32 compare_and_exchange(volatile uint32* dst,
                            uint32 src);

It must be implemented by assembly:

compare_and_exchange:
  mov edx, [esp + 4]
  mov ecx, [esp + 8]
  mov eax, 0
  lock cmpxchg [edx], ecx
  ret

cmpxchg instruction is a compare and exchange instruction, its function is to compare the first operand with the value of eax

If they are the same, load the second operand into the first operand;
If they are different, assign the value of the first operand to eax ;

( cmpxchg prefixed with lock , so that when the instruction is executed on a multi-core CPU, the access to the memory is guaranteed to be exclusive ( exclusive ) and can be perceived by other cores ( visible ). This involves the cache coherence of the multi-core CPU. , You can skip it temporarily. For the single-core CPU used in our project experiment, the lock is not necessary.)

In fact, this instruction realizes the atomic merging of check and modify the We use it to implement lock , use the operand dst to mark whether the lock has been locked, and compare it with eax = 0:

If they are equal, then it is the first case, 0 means no lock, then assign 1 to dst , which means that lock and locked, and the return value is eax = 0 ;
If not, then it means that dst is equal to 1, and lock has been locked by someone else. Then it is the second case. dst = 1 value of eax to 0611544eceec27, the return value is eax, which has been modified to 1;

int lock_hold = 0;

void acquire_lock() {
    if (compare_and_exchange(&lock_hold, 1) == 0) {
        // Get lock!
    } else {
        // NOT get lock.
    }
}

In addition to the cmpxchg instruction, there is another way to implement it is the xchg instruction, which I personally feel is better understood:

atomic_exchange:
  mov ecx, [esp + 4]
  mov eax, [esp + 8]
  xchg [ecx], eax
  ret

xchg instruction has two operands, which represent the exchange of their values, and then the atomic_exchange function will return the value of the second operand after the exchange, which is actually the old value before the exchange of the first parameter.

How to use atomic_exchange to realize the function of lock The same code:

int lock_hold = 0;

void acquire_lock() {
    if (atomic_exchange(&lock_hold, 1) == 0) {
        // Get lock!
    } else {
        // NOT get lock.
    }
}

Trying to get a lock person, always with a (locked) this value, go and lock_hold exchange, so atomic_exchange function always returns lock_hold old values, is about to lock_hold old values of exchange and returned only when lock_hold old value of 0 When the above judgment can be passed, it means that the lock was not held by anyone before, and then the lock was successfully obtained.

It can be seen that only one instruction is used here to complete the lock_hold . The interesting thing is that it is changed first, and then checked, but this does not affect its correctness in the slightest.

Spinlock

The underlying implementation of the acquire lock was discussed above, but this is just the tip of the iceberg of lock related issues. The real complexity of lock lies in the processing after acquire fails. This is also a very important way to classify locks, which greatly affects the performance and usage scenarios of locks.

lock to be discussed here is the spin lock ( spinlock ), which will continue to retry after the acquire fails until it succeeds:

#define LOCKED_YES  1
#define LOCKED_NO   0

void spin_lock() {
  while (atomic_exchange(&lock_hold , LOCKED_YES) != LOCKED_NO) {}
}

This is a busy wait method. The current thread continues to hold the CPU and keeps retrying acquire, which is simple and rude;

First of all, it needs to be clear that spinlock cannot be used on a single-core CPU. On a single-core CPU, only one thread is executed at the same time. If the lock cannot be obtained, such spin idling has no meaning, because the thread holding the lock cannot release the lock during this period.

However, if on a multi-core CPU, spinlock useful. If acquire lock fails, continue to retry for a period of time, you may wait for the thread holding the lock to release the lock, because it is very likely to be running on another core at this time, and it must be within critical section :

This critical section very small and lock competition is not very fierce, because in this case, the probability of spin wait will not be very long. And if the current thread lock , it may pay a higher price, which will be discussed in detail later.

However, if critical section relatively large or the lock competition is fierce, even on multi-core CPUs, spin lock is not applicable. It is unwise to keep waiting and spin idling to waste CPU time.

yield lock

As mentioned above, spinlock for a single-core CPU, but our kernel happens to run on a single-core CPU simulator, so we need to implement a lightweight lock similar to spinlock. I will call it yield lock .

As the name implies, yield lock refers to actively giving up the CPU after the acquire fails. That is to say, I can't get the lock temporarily, I will take a rest, let other threads run first, and wait for them to rotate for a full circle, then I will come back and try again.

Its behavior is essentially a spin, but different from idling in place, it does not waste any CPU time, but immediately gives the CPU to others, so that the thread holding the lock may be run and wait for the next round of time After the slice has finished running, it is likely to have released the lock:

void yield_lock() {
  while (atomic_exchange(&lock_hold , LOCKED_YES) != LOCKED_NO) {
    schedule_thread_yield();
  }
}

Note that schedule_thread_yield must be placed in the while loop, because even if the thread holding the lock releases the lock, it does not mean that the current thread will be able to get the lock later, because there may be other competitors, so after the yield returns , Must compete for acquire lock again;

And spinlock Similarly, yield lock also suitable for critical section relatively small, competition is not very intense situation, otherwise many threads again and again waited in vain, it is a waste of CPU resources.

Blocking lock

The above two locks are both non-blocking lock , which means that thread will not block when acquire fails, but will keep retrying, or retrying after a period of time, essentially retrying. However, when critical section relatively large or the competition for lock is fierce, constant retry is likely to be futile, which is a waste of CPU resources.

To solve this problem, there is the blocking lock ( blocking lock ), which maintains a queue internally. If the thread cannot get the lock, it will add itself to the queue to sleep and give up the CPU. During sleep, it will not It is scheduled to run again, that is, it enters the blocking state; when the thread holding the lock releases the lock, a thread will be taken out of the queue to wake up again.

For example, we define the following blocking lock, named mutex :

struct mutex {
  volatile uint32 hold;
  linked_list_t waiting_task_queue;
  yieldlock_t ydlock;
};

The realization of the lock:

void mutex_lock(mutex_t* mp) {
  yieldlock_lock(&mp->ydlock);
  while (atomic_exchange(&mp->hold, LOCKED_YES) != LOCKED_NO) {
    // Add current thread to wait queue.
    thread_node_t* thread_node = get_crt_thread_node();
    linked_list_append(&mp->waiting_task_queue, thread_node);
    
    // Mark this task status TASK_WAITING so that
    // it will not be put into ready_tasks queue
    // by scheduler.
    schedule_mark_thread_block();
    
    yieldlock_unlock(&mp->ydlock);
    schedule_thread_yield();

    // Waken up, and try acquire lock again.
    yieldlock_lock(&mp->ydlock);
  }
  yieldlock_unlock(&mp->ydlock);
}

The implementation of lock here is already more complicated. This is actually the standard implementation conditional wait The so-called conditional wait , that is, the condition waits for , which is to wait for an expected condition to be met in a blocking manner. The expected condition to wait here is: the lock is released, I can try to obtain the lock.

After the attempt to acquire the lock failed, the current thread added itself to the waiting_task_queue mutex, and marked itself as TASK_WAITING , and then gave up the CPU; here gave up CPU, as in the yield lock schedule_thread_yield function, However, they have essential differences:

yield lock in the thread through the yield, will still be placed in ready_tasks queue scheduler will still be behind schedule;
The thread here first marks TASK_WAITING , so in schedule_thread_yield , the thread will not be added to the ready_tasks queue, so it actually enters the blocking state, and will not be scheduled again until the lock is held. Thread next time unlock , take it out of waiting_task_queue of mutex to wake it up, and put it in the ready_tasks queue again;

void mutex_unlock(mutex_t* mp) {
  yieldlock_lock(&mp->ydlock);
  mp->hold = LOCKED_NO;

  if (mp->waiting_task_queue.size > 0) {
    // Wake up a waiting thread from queue.
    thread_node_t* head = mp->waiting_task_queue.head;
    linked_list_remove(&mp->waiting_task_queue, head);
    // Put waken up thread back to ready_tasks queue.
    add_thread_node_to_schedule(head);
  }
  yieldlock_unlock(&mp->ydlock);
}

The above lock and unlock code where there are key elements that define mutex an internal yieldlock . This seems strange, because mutex is a lock. As a result, the internal data of this lock actually needs another lock to protect itself. Isn't this a doll?

In terms of implementation, mutex already a more complex lock. It maintains a waiting queue internally, and this queue obviously needs to be protected, so there is the above doll paradox. The key point here is that the type and purpose of the two layers of locks are essentially different:

mutex is a heavy-duty lock. It is provided for external use. Its purpose and protection object are uncertain. Generally speaking, critical section relatively large area with fierce competition;
Internal yield lock is lightweight lock, purpose and object of protection is the lock is determined, it is protected mutex internal operations, this critical section can control very small, so that the lock of the introduction is necessary and reasonable ；

The price of the internal yield lock is that it introduces new competition, which makes the competition for threads on the mutex However, such additional cost from mutex is the design and use case is inevitable, in a sense also affordable and neglected: because usually think mutex external protection of critical section is relatively large, compared to its internal yield lock In terms of protected areas.

Kernel and user mode lock

What has been discussed above are the principles and implementations of several locks, as well as the differences in their usage scenarios. A very important distinguishing principle is that critical section and the intensity of competition essentially reflect the ease (or probability) of each thread trying to obtain the lock. Based on this, we can divide it into two categories. Usage:

critical section small and the competition is not fierce, then use spin type locks (including spinlock and yieldlock ), which are non-blocking;
critical section very large, or the competition is fierce, then use a blocking lock;

In fact, the choice of which lock to use in the kernel state is far more than simple. It also involves the use of locks, such as interrupt context or thread context, and many other considerations, which will bring many restrictions and differences. Regarding these issues, I will try to write a separate article discussion in the future when I have time, first dig a hole.

And if it comes to the user mode, lock and the kernel mode will be very different. The most discussed one is the choice of spinlock As mentioned above, blocking locks are often used critical section very large or highly competitive, because it does not cause a large amount of CPU idling, which seems to save CPU resources, but the thread in the user mode needs to fall into the kernel to enter the blocking sleep. This has a very large impact on the performance of the CPU, and it may not be as cost-effective as waiting in place to spin (provided that it is a multi-core CPU), so the considerations here and the use of locks in the kernel state will be very different.

Summarize

This article discusses the principle and realization of lock, which is limited to my own level. It is just my own superficial understanding. I hope it can be helpful to readers. Welcome to discuss and comment together. In this scroll project, performance is not our consideration for the time being. For the sake of simplicity and safety, I used yieldlock extensively as the main lock under the kernel.

Write OS kernel from scratch-lock and multi-thread synchronization

Series catalog

Multithreaded competition

Lock

Atomic instruction operation

Spinlock

yield lock

Blocking lock

Kernel and user mode lock

Summarize

navi

引用和评论

大数定律

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

现在纠结于到底是学stm32好还是Arduino好？

STM32真的是很落后吗？

如何系统地入门学习stm32？

理想正式官宣开源！杀疯了！