Writing OS kernel from scratch-system call

Series catalog

System call

from the implementation of the , this article will start to actually create the process, using the familiar fork system call, so first we need to build the framework of the system call

system call does not need to be repeated. It is the external function interface provided by the kernel for the user, and is the main way for the user to actively request to call the kernel function. Since it is from user to kernel state, it needs to be triggered by interruption. int 0x80 the classic way of Linux 32-bit system, we will also use the soft interrupt of syscall enter 061019420ca0b2.

Since syscall is for users, its entire implementation includes two parts:

user part: a unified function interface, the bottom layer is to trigger an interrupt through int 0x80;
Kernel part: similar to normal interrupt processing;

user interface

First look at the implementation of the user part. Note that this part of the code is compiled and linked into the user program, not the kernel. It will be packaged in a form similar to the standard library, and will link in when we write the user program later.

The code in this section is mainly the following files, according to the calling relationship from top to bottom:

syscall.h and syscall.c , here is the user-level interface;
syscall_trigger.S , this is the realization of interrupt triggering and parameter transfer;

look at the user-level interface in 161019420ca1d9 syscall.c syscall function called directly by the user, which is similar to what we usually use in Linux:

int32 fork();
int32 exec(char* path, uint32 argc, char* argv[]);

Their bottom layer calls the trigger function provided by syscall_trigger.S, which is the place where syscall interrupts and parameters are actually triggered:

int32 fork() {
  return trigger_syscall_fork();
}

int32 exec(char* path, uint32 argc, char* argv[]) {
  return trigger_syscall_exec(path, argc, argv);
}

trigger_syscall_xxx implementation of 061019420ca253 is defined in syscall_trigger.S .

syscall uses the unified int 0x80 interrupt trigger, but because there are many syscalls, each syscall has a number, for example:

SYSCALL_FORK_NUM   equ  1
SYSCALL_EXEC_NUM   equ  2

In addition, the difference between syscall and general interrupts is that parameters need to be passed. Therefore, according to the number of parameters, we syscall_trigger.S , such as syscall without parameters:

%macro DEFINE_SYSCALL_TRIGGER_0_PARAM 2
  [GLOBAL trigger_syscall_%1]
  trigger_syscall_%1:
    mov eax, %2
    int 0x80
    ret
%endmacro
        
DEFINE_SYSCALL_TRIGGER_0_PARAM   fork,   SYSCALL_FORK_NUM

In this way, we actually get the underlying trigger function fork

[GLOBAL trigger_syscall_fork]
trigger_syscall_fork:
  mov eax, SYSCALL_FORK_NUM
  int 0x80
  ret

syscall essentially takes parameters. At the very least, we will use eax save the syscall number. If syscall itself has parameters, then other registers will be used, such as ecx , edx , ebx etc. Of course, these are all manually specified.

For example, exec has three parameters:

trigger_syscall_exec:
  push ebx

  mov eax, %2
  mov ecx, [esp + 8]
  mov edx, [esp + 12]
  mov ebx, [esp + 16]
  
  int 0x80

  pop ebx
  ret

We use ecx , edx , ebx passed in turn trigger_syscall_exec three parameters. Note that ebx is a push save here, because according to the x86 specification ( calling convention ), ebx is the callee-saved register, which needs to be saved and restored actively.

Prepare the registers and transfer parameters, and then the trigger function will use int 0x80 trigger an interrupt. This interrupt is the unified entry point for the system call, and then enters the kernel processing flow.

Kernel handles syscall

The main code of this section is the following files:

syscall_wrapper.S is a unified entry for syscall processing;
syscall_impl.h and syscall_impl.c are real syscall processing implementations;

Of course, before that, syscall is an interrupt, so you must first register the handler function of the 0x80 src/interrupt/interrupt.c , the entry is syscall_entry function:

set_idt_gate(SYSCALL_INT_NUM,
             (uint32)syscall_entry,
             SELECTOR_K_CODE,
             IDT_GATE_ATTR_DPL3);

Look at the syscall_entry function, which is basically the same as the entry function of the general interrupt, and it is also divided into two parts.

The upper part is to save the user's context, including all general registers, segment registers, etc., and then call syscall_handler enter the real syscall distribution processing.

syscall_entry:
  ; push dummy to match struct isr_params_t
  push byte 0
  push byte 0
  ; save common registers
  pusha
  ; save original data segment
  mov cx, ds
  push ecx
  ; load the kernel data segment descriptor
  mov cx, 0x10
  mov ds, cx
  mov es, cx
  mov fs, cx
  mov gs, cx

  sti  ; allow interrupt during syscall
  call syscall_handler

The lower part is the return, which is similar to the interrupt return, restoring all the registers saved above. But one thing to note is that eax cannot pop because syscall has a return value. It is eax saves the return value of syscall_handler:

syscall_exit:
  ; recover the original data segment.
  ; Do NOT use eax because it's the syscall ret value!
  pop ecx
  mov ds, cx
  mov es, cx
  mov fs, cx
  mov gs, cx

  pop  edi
  pop  esi
  pop  ebp
  pop  esp
  pop  ebx
  pop  edx
  pop  ecx
  ; skip eax because it is used as return value
  ; for syscall_handler
  add  esp, 4

  ; pop dummy values
  add esp, 8

  ; pop cs, eip, eflags, user_ss, and user_esp by processor
  iret

syscall_handler is the real syscall distribution processing function. It eax , and finds the corresponding syscall processing implementation:

int32 syscall_handler(isr_params_t isr_params) {
  // syscall num saved in eax.
  // args list: ecx, edx, ebx, esi, edi
  uint32 syscall_num = isr_params.eax;

  switch (syscall_num) {
    case SYSCALL_FORK_NUM:
      return syscall_fork_impl();
    case SYSCALL_EXEC_NUM:
      return syscall_exec_impl((char*)isr_params.ecx,
                               isr_params.edx,
                               (char**)isr_params.ebx);
    default: PANIC();
  }
}

Note that syscall_handler is the same as the ordinary interrupt processing function, and it also takes the data isr_params in on the entire interrupt stack as a whole 061019420ca579 structure as a parameter:

If it is a normal interrupt, the value of the general-purpose register saved on the stack is used to save and restore the context information before the interrupt occurs; but in syscall, their role has changed, and some of them are actually used as syscall The parameters are passed, syscall_handler above and used by the processing functions of each syscall.

Recall, where are the register values used to pass parameters set? trigger_syscall_xxx function that triggers syscall on the user side, where we assign the initial parameters when the user calls syscall to each register:

trigger_syscall_exec:
  push ebx

  mov eax, %2
  mov ecx, [esp + 8]
  mov edx, [esp + 12]
  mov ebx, [esp + 16]
  
  int 0x80

  pop ebx
  ret

Here we need to clarify the entire parameter transmission chain of syscall

In the trigger part of the user side, the parameters are stored in various general-purpose registers;
Trigger the interrupt, after entering the kernel stack, the values of these registers are pushed into the interrupt stack, encapsulated in the isr_params structure, and finally given to the syscall_handler function;

At the same time, we noticed that if the callee-saved register is used to pass the parameters, then their values will be saved in the user stack first, such as ebx above. This actually means that in the process of saving and restoring the user context, some registers are trigger_syscall_xxx on the user stack, not after entering the interrupt, because the values of some registers saved on the interrupt stack will be used later For parameters passed in syscall, their values will be overwritten, so they must be saved on the user stack in advance. This is also the difference between syscall and ordinary interrupt.

The essential reason for this is that syscall is initiated actively rather than unpredictable like a general interrupt, so it is actually more like an ordinary function call. As long as the caller (user) follows the x86 function call specification ( calling convention ), he first saves the callee-saved on his stack, and then he can use these registers to pass parameters at will, and finally int 0x80 and enter the kernel stack handle.

Implementation of fork

All the above mentioned are syscall , now we will implement the first syscall: fork .

In syscall_handler , fork is distributed to syscall_fork_impl function, the specific implementation is process_fork function, which src/task/process.cff .

I believe you should be familiar with the use fork

int pid = fork();
if (pid > 0) {
  // parent process
} else if (pid == 0) {
  // child process
} else {
  // fork failed
}

Unfortunately, our first system call fork is a bit more complicated. Fork will create a new child process the same as the parent process, they will all return from fork and continue to execute, the difference lies in the return value. The parent process will return the pid of the created child, and the child process will return 0.

First, the create_process function is called to create a brand new process structure, and the corresponding fields are initialized; however, note that the child's page directory is copied from the parent, so that they can share the virtual memory space:

pcb_t* create_process(char* name, uint8 is_kernel_process) {
  pcb_t* process = (pcb_t*)kmalloc(sizeof(pcb_t));
  memset(process, 0, sizeof(pcb_t));
  //...
  process->page_dir = clone_crt_page_dir();
}

Then came the most critical function fork_crt_thread , which is to copy the current thread. Its main function is to copy the current kernel stack, and then set the stack to look like when a new thread is started, so that the child thread will wait for a while It can be started normally like a new thread. Although it is started for the first time, it looks like it is the same as the parent, returning from the fork.

Recall the stack when the kernel thread starts:

The stack starts at kernel_esp , pops up the general-purpose register, and then start_eip as the entry point. Here we will child thread start_eip set syscall_fork_exit :

thread->kernel_esp = kernel_stack + KERNEL_STACK_SIZE
    - (sizeof(interrupt_stack_t) + sizeof(switch_stack_t));

switch_stack_t* switch_stack = (switch_stack_t*)thread->kernel_esp;
switch_stack->thread_entry_eip = (uint32)syscall_fork_exit;

syscall_fork_exit This function, to be exact, the best name is syscall_fork_child_exit , it is used for the child process return after the fork is completed, it is different from the normal syscall return in the recovery part of the general register:

  pop edi
  pop esi
  pop ebp
  ; Do NOT pop old esp!
  ; Child process is its own stack, not parent's.
  add esp, 4
  pop ebx
  pop edx
  pop ecx
  ; child process returns 0.
  mov eax, 0
  add esp, 4

esp and eax have made special treatments:

esp saved on the stack is the esp of the parent, and the child has already allocated its own stack, so skip it;
eax used as fork , which must be 0 in child;

After running to iret , the interrupt returns, where the CPU will restore the running state of the user thread before the syscall:

That is, the code + stack information of the user thread:

code: saved in cs + eip ;
stack: saved in user_ss + user_esp ;

This part of the information is the same as the content in the parent's stack, because the child's kernel stack is copied from the parent. This is why after the child returns to the user state, it can fork like the parent, as if the parent mirrored a task for itself. Of course, their virtual memory space is isolated, which uses the copy-on-write mechanism described in the previous article.

After the parent is fork_crt_thread , it completes the finishing work of creating the child process, and then returns. The return value is the pid of the child process that was just created:

// Create new process and fork current thread.
pcb_t* process = create_process(nullptr);  
tcb_t* thread = fork_crt_thread();
if (thread == nullptr) {
  return -1;
}

// Bind child thread to child process.
add_process_thread(process, thread);

// Add child thread to scheduler to run.
add_thread_to_schedule(thread);

// Parent should return the new pid.
return process->id;

It can be seen that the parent returns from syscall normally, and the child's kernel stack has been modified by us, so that it runs as the thread is started for the first time, but two points need to be paid attention to:

Its interrupt stack must be consistent with the parent, so that when the interrupt returns, the same user thread operating environment as the parent can be restored; so after the child returns to the user state, it looks like a task the same as the parent continues to run , Which is also the original intention fork
The return value must be 0;

Summarize

The content of this article is still a bit much. First of syscall , the framework of 061019420ca9ca is implemented. It is necessary to distinguish the functional responsibilities of the user side and the kernel side, as well as the similarities and differences between syscall and ordinary interrupts. The most important thing here is the flow of data on register and stack. Process. On this basis, we have implemented the most challenging fork in syscall, hope it can help you deeply understand the entry and return mechanism of syscall.

Writing OS kernel from scratch-system call

Series catalog

System call

user interface

Kernel handles syscall

Implementation of fork

Summarize

navi

引用和评论

大数定律

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

现在纠结于到底是学stm32好还是Arduino好？

STM32真的是很落后吗？

如何系统地入门学习stm32？

理想正式官宣开源！杀疯了！