"In-depth understanding of computer systems-anomalies"

Modern operating systems react to certain unexpected situations (disk read and write data is ready, hardware timers generate signals, etc.) by making abrupt changes in the control flow. Generally speaking, we named these mutations Exceptional Contral Flow ECF. Abnormal control flow occurs at all levels of the computer system. For example, at the hardware layer, when the hardware detects the time, it triggers a sudden transfer of control to the exception handler. At the operating system layer, the kernel transfers control from one user process to another user process through context switching. At the application layer, a process can send a signal to another process, and the receiver will transfer control to a signal handler. A program can react by avoiding the usual stack rules and executing non-local jump errors anywhere in other functions.

Why do I need to understand ECF?

Helps to understand important system concepts
Helps to understand how the application interacts with the operating system
Help understand concurrency
Help to understand how software exceptions work (such as C++/JAVA try-cache-throw software exception mechanism)

abnormal

异常控制流.png

Exception is a form of abnormal control flow, which is implemented partly by hardware and partly by operating system.

An exception is a sudden change in control flow that is used to respond to certain changes in the state of the processor. In the above figure, when an important change occurs in the processor state, the processor is executing a current instruction Icur. In the processor, the state is encoded as different bits and signals. The change of state is called an event. The event may be directly related to the execution of the current execution, such as virtual memory page failure, arithmetic overflow, division by zero, or it may not be related to the current instruction, such as the signal generated by the system timer, the completion of the I/O request, etc.

In any case, when the processor detects that an event has occurred, it will make an indirect procedure call through a jump table called an exception table. To an operating system subroutine (exception handler) specifically involved in handling such events. When the exception handler finishes processing, depending on the type of event that caused the exception, the following will happen:

Re-execute Icur (if page fault occurs)
Continue to execute I_next (such as receiving I/O device signal)
Terminate the program (such as receiving a kill signal)

Exception handling

异常表.jpeg

Each type of exception possible in the system is assigned a unique non-negative integer exception number (exception number). Some of these numbers are assigned by the designer of the processor, and other numbers are assigned by the designer of the operating system kernel (the part of the operating system's resident memory). Examples of the former include division by zero, page faults, memory access violations (such as segment faults), breakpoints, arithmetic overflows, etc., and the latter include system calls and signals from external I/O devices. At system startup (when restarting or powering up), the operating system allocates and initializes a jump table called an exception. Each entry k contains the jump address of the handler for exception k. The start address of the exception table is placed in a special cpu register called the exception table base address register.

Exceptions are similar to procedure calls, but there are still important differences:

During a procedure call, the processor will push the return address onto the stack before jumping to the handler. But for different exception types, the return address may be the current instruction or the next instruction
The processor will also push some additional processor state onto the stack, and when the processing program returns, these states are needed to restart the execution of the interrupted program.
If control is transferred from the user program to the kernel, then all these items will be pushed onto the kernel stack
The exception handler runs in kernel mode, which means that the exception handler has full access to all system resources (Q: What about the user-specified exception handler?)

When the exception handler is completed, it optionally returns to the interrupted program by executing a special "return from interrupt" instruction, which bounces the appropriate state back into the processor's control and data registers. If it is a user program that is interrupted abnormally, the state is restored to the user mode, and then control is returned to the interrupted program.

Abnormal category

Exceptions are divided into 4 categories: interrupt, trap, fault, abort:

Interrupt: It occurs asynchronously and is the result of a signal from an I/O device outside the processor (such as the completion of hard disk data reading). Generally, this kind of signal is an external hardware device that sends a signal to a pin on the processor, and places the exception number (identifying the device that caused the interrupt) on the system bus to trigger the interrupt. After the current instruction is completed, the processor notices that the interrupt pin voltage becomes high, reads the exception number from the system bus, and calls the appropriate exception handler. After the exception handling is completed, execute the next instruction I_next.
Trap and system call: It is an intentional exception, which is the result of executing an instruction (such as executing malloc, reading and writing files, fork, execve, exit, etc.). The processor provides a special "syscall n" (n is the system call Number, the operating system has a corresponding system call table, and the entry i in the table identifies the address of the handler for system call i) to process these system calls. After the interrupt handler is executed, switch the program to user mode and execute the next instruction I_next. Ordinary functions running in user mode can only access the same stack as the calling function, but system calls run in kernel mode, so a line of privileged instructions are allowed and access to the stack defined in the kernel.
Fault: Caused by an error, it can usually be corrected by the fault handler. When a fault occurs, the processor transfers control to the fault handler. If the fault handler can correct the error, see control returns to the instruction that caused the fault, and re-execute it. If the handler returns to the abort process in the kernel, abort will terminate the application that caused the fault. Common faults such as: page missing.
Termination: The result of an unrecoverable fatal error, usually some hardware errors, such as parity errors that occur when DRAM/SRAM is damaged. The termination handler never returns control to the application, but directly returns to the kernel's abort process.

linux系统调用函数先将系统调用好写入寄存器%rax，然后将参数（如mallo的字节数量）写入寄存器%rdi等，然后调用“syscall”指令来调用系统调用。

process

The classic definition of a process is an instance of an executing program. Every program in the system runs in the context of a certain process. The context consists of the state required for the correct operation of the program. This state includes the code and data of the program stored in the memory, its stack, the contents of general purpose registers, the program counter, environment variables, a collection of opened file descriptors, and so on.

Logic control flow

逻辑控制流.png

The process uses the processor in turn. Each process executes a part of its flow, then is preempted, and then it is the turn of other processes. For a program running in the context of one of these processes, it looks like it is using the processor exclusively.

Concurrent stream

The execution time of one logic flow overlaps with another flow, which is called concurrent flow. These two streams are said to run concurrently. The general phenomenon of concurrent execution of multiple streams is called concurrency. The concept of a process and other processes running in turn is called multitasking. Each time period during which a process executes a part of its control flow is called a time slice.

Private address space

进程私有地址空间.png

The process also provides an illusion for each program: as if it exclusively uses the system address space. The process provides each program with its own private address space. Generally speaking, the memory byte associated with an address in this space (that is, the virtual address space we call) cannot be read or written by other processes.

Although the content of the memory associated with each private address space is generally different, each such space has the same general structure. The address space is reserved for user programs, including the usual code, data, heap, and stack segments. The code segment always starts at 0x400000. The top of the address space is reserved for the kernel (the part of the operating system's resident memory). This part of the address space contains the code, data, and stack used by the kernel when executing instructions without inheritance (such as when an application executes a system call).

User mode and kernel mode

In order to limit the instructions that an application can execute and the range of address space it can access, the processor uses a mode bit in a certain control register to describe the permissions currently enjoyed by the process: a mode bit of 1 indicates that the process is running in kernel mode. Execute any instruction in the instruction set and access any memory location in the system. If the mode bit is not set, the flag is in user mode, and privileged instructions are not allowed to execute (such as stopping the processor, changing the bit mode, initiating I/O operations, referencing code and data in the kernel area in the address space). The user program must access the kernel code and data through system calls.

The only way for a process to change from user mode to kernel mode is through exceptions such as interrupts, faults, or traps in system calls. When an exception occurs, control is passed to the exception handler, and the processor changes the mode from user mode to kernel mode. When the exception handler returns to the application code, the processor changes the mode from kernel mode to user mode.

The /proc file system in Linux allows user-mode processes to access the contents of the kernel structure. The /proc file system outputs the contents of many kernel data structures as a hierarchical structure of text files that can be read by user programs.

/proc/cpuinfo
/proc/$pid/maps, etc.

Thinking:
Is /proc stored on disk? If not, how is it achieved?
Implement a program, imitating /proc, write the current program's process id and memory usage to a file.

Context switch

The kernel maintains a context for each process. The context is the state required by the kernel to restart a preempted process (general purpose registers, floating point registers, program counters, user stacks, status registers, kernel stacks, kernel data structures), such as page tables describing the address space, including related The process table of the current process information, the descriptor of the opened file, etc.

System calls may cause context switches, such as I/O reads and writes. Interrupts may also cause context switching. For example, all operating systems have a periodic timer interrupt mechanism, usually 1ms or 10ms. Every time a timer interrupt occurs. The kernel determines that the current process has been running for a long enough time and switches to a new process.

System call error handling

When an error occurs in unix system-level functions, they usually return -1 and set the global variable errno to identify the error. The program should always check for errors.

if ((pid = fork()) < 0){
    // strerror 返回一个文本串，描述了和某个error值相关联的错误。
    fprintf(stderr, "fork error: %s\n", strerror(errno));
    exit(0)
}

Process control

The unix system provides a large number of system calls from the C program to operate the process.

pid_t getpid<void>;
pid_t getppid<void>;
pid_t fork(void);
void exit(int status);

The newly created child process is almost but not exactly the same as the parent process. The child process gets the same (and independent) copy of the user and virtual address space of the parent process, including code and data segments, heap, shared libraries, and user stack. The child process also gets the same copy of any open file descriptor of the parent process, which means that the child process can read and write any file opened by the parent process. Any changes made by the subsequent father and son processes are independent, have their own private address space, and will not be reflected in the memory of the other process.

The fork function is only called once, but it will return twice. Once in the parent process and once in the newly created child process. In a program with multiple fork instances, this can easily be confusing. How many hellos are output in the following example?

int main(){
    fork();
    fork();
    printf("hello\n");
    exit(0);
}

When a process terminates for some reason, the kernel does not immediately remove it from the system. On the contrary, the process is protected in a terminated state, knowing to be recycled by its parent process (raped, that is, the child process exit signal is processed by the parent process). When the parent process reclaims the terminated child process, the kernel passes the exit status of the child process to the parent process, and then discards the terminated process. From this point on, the process no longer exists. A process that has been terminated but has not yet been recycled is called a zombie process. (Zombie processes will still take up memory, so we should always be careful to recycle the child processes created by ourselves).

If a parent process terminates, the kernel will arrange for the init process to be called the adoptive parent of its orphan process. The PID of the init process is 1, which is created by the kernel when the system starts. It does not terminate and is the ancestor of all processes. If the parent process terminates without reclaiming its dead child processes, the kernel will arrange for the init process to reclaim them.

// 成功则返回子进程pid，pid=-1，表示等待所有子进程。statusp用来存储子进程的退出状态
// 如果繁盛错误，则返回-1，并设置errnos（无子进程errno则为ECHILD）
pid_t waitpid(pid_t pid, int *statusp, int options);
// waitpid的简化版本，等价于waitpid(-1, &status, 0)
pid_t wait(int *statusp)

// 将进程刮起一段指定的时间
unsigned int sleep(unsigned int secs);
// 将进程休眠，直到该进程收到一个信号
int pause(void);

// 加载并运行可执行目标文件filename，argv为参数列表，envp为环境变量列表
int execve(const char *filename, const char *argv[], const char *envp[]);

execve loads and runs a new program in the context of the current process. It will overwrite the address space of the current process, but does not create a new process, and inherits all file descriptors that have been opened when the execve function is called. Refer to the "Link" section.

signal

Linux is a higher-level software form of exception, which allows processes and the kernel to interrupt other processes.

Linux信号.png

The above figure shows the signals supported on the Linux system. The first 30 are the most common in practical applications. Each signal type corresponds to a certain system time. Low-level hardware exceptions are completed by the kernel exception handler, and are invisible to user processes under normal circumstances. The signal provides a mechanism to notify the user that these abnormalities have occurred in the process. For example, if a process tries to divide by 0, the kernel sends a SIGFPE signal; Ctrl-C sends a SIGINT signal; Ctrl-Z means sends a SIGSTP signal; SIGKILL is a forced termination (the signal cannot be captured and the handler cannot be rewritten); SIGCHLD is the termination of the child process.

The process of transmitting a signal to the destination consists of two different steps.

Sending a signal: The kernel sends a signal to the target program by updating a certain state in the context of the target program (the signal bit table of the process). There are two reasons for sending a signal: the kernel detects a system event, such as a division by zero error or the termination of a child process; a process calls kill, which explicitly requires the kernel to send a signal to the target process. A process can send signals to itself.
Receiving signal: When the target process is forced by the kernel to react to the sending of the signal in some way, it receives the signal. The process can ignore this signal, terminate, or catch this signal by executing a user-level function called a signal handler.

A signal that is sent but not received is called a signal to be processed. At any time, there will be at most one signal to be processed of one type. Therefore, if you repeatedly send multiple signals k to a process, if the process does not process the previous one, all subsequent signals k will be discarded.

The kernel maintains a set of pending signals for each process in the pending position. And the blocked bit loudly maintains the blocked signal collection. Therefore, the so-called sending, that is, the kernel sets the k-th position of pending to 1, and the receiving is set to 0.

Send signal

// 给进程pid发送信号sig
kill -$sig $pid
int kill(pid_t pid, int sig)

When we start a job in the shell (such as ls|sort), two processes will be started, both of which belong to the same process group. When we perform Ctrl-C, the kernel will send SIGINT to each process in the process group.

receive signal

When the kernel switches process p from kernel mode to user mode (for example, returning from a system call or completing a context switch), it checks the set of unblocked pending signals of process p. If the set is empty, the kernel passes control to the next instruction I_next in the logic control flow of p. However, if the set is not empty, the kernel selects a certain signal k of the set (usually the signal with the smallest value first), and forces the process p to receive the signal k. Receiving a signal triggers the process to take some action (signal handler). Once this behavior is completed, the process passes control to the next instruction I_next of the logic control flow of p. Each signal type has a predefined default behavior (some signal behaviors are allowed to be rewritten by the user program, SIGSTOP and SIGKILL are not allowed to be rewritten), which is one of the following:

Termination: If SIGKILL signal is received
Terminate and dump to memory
Stop until restarted by SIGCONT signal
Ignore: If SIGCHLD is received

The signal processing program can be interrupted by other signal processing programs.

Blocking and contact blocking signals

// how: SIG_BLOCK表示屏蔽信号，SIG_UNBLOCK表示接收信号
// set：需要操作的信号集合
// oldset：非空，则将blocked位向量的值保存在oldset
int sigprocmask(int how, const sigset_t *set, sigset_t *oldset)

in principle

Signal processing is very troublesome work: the processing program and the main program run concurrently, sharing the same global variables, so they may interfere with each other; different systems have different signal processing semantics; the signal processing program may be interrupted by other signals. Therefore, we generally need to follow the following principles when writing signal processing programs:

The processing procedure is as simple as possible
In the handler, only functions that are safe for asynchronous signals (that is, functions that are reentrant or cannot be interrupted) are called. printf, malloc, exit, etc. are not asynchronous signals safe
Save and restore errno. Many functions that are safe for asynchronous signals will set errno when returning from an error, so they may interfere with other parts of the main program that have been errno.
Block all signals and protect access to shared global data structures. If the handler and the main program share a global data structure, all signals should be blocked before accessing the structure
Declare global variables with voliatile
Declare the flag with sig_atomic_t
When we receive a signal, it only means that this type of event has only occurred once (because repeated pending signals will be discarded)

In the following example, the operation of the job is a global operation, and in actual applications, the operation of the job is generally not atomic.

#include "csapp.h"

void initjobs()
{
}

void addjob(int pid)
{
}

void deletejob(int pid)
{
}

/* $begin procmask2 */
void handler(int sig)
{
    int olderrno = errno;
    sigset_t mask_all, prev_all;
    pid_t pid;

    Sigfillset(&mask_all);
    while ((pid = waitpid(-1, NULL, 0)) > 0) { /* Reap a zombie child */
        // 阻塞其他信号，防止job列表被并发修改
        Sigprocmask(SIG_BLOCK, &mask_all, &prev_all);
        deletejob(pid); /* Delete the child from the job list */
        Sigprocmask(SIG_SETMASK, &prev_all, NULL);
    }
    if (errno != ECHILD)
        Sio_error("waitpid error");
    errno = olderrno;
}
    
int main(int argc, char **argv)
{
    int pid;
    sigset_t mask_all, mask_one, prev_one;

    Sigfillset(&mask_all);
    Sigemptyset(&mask_one);
    Sigaddset(&mask_one, SIGCHLD);
    Signal(SIGCHLD, handler);
    initjobs(); /* Initialize the job list */

    while (1) {
        Sigprocmask(SIG_BLOCK, &mask_one, &prev_one); /* Block SIGCHLD */
        if ((pid = Fork()) == 0) { /* Child process */
            Sigprocmask(SIG_SETMASK, &prev_one, NULL); /* Unblock SIGCHLD */
            Execve("/bin/date", argv, NULL);
        }
        Sigprocmask(SIG_BLOCK, &mask_all, NULL); /* Parent process */  
        addjob(pid);  /* Add the child to the job list */
        Sigprocmask(SIG_SETMASK, &prev_one, NULL);  /* Unblock SIGCHLD */
    }
    exit(0);
}
/* $end procmask2 */

Non-local jump

C provides a form of user-level exception control flow, called nonlocal jump, which transfers control directly from one function to another function being executed without going through the normal call-return sequence. Non-local jumps are provided by the setjmp and longjmp functions.

int setjmp(jmp_buf env);
int longjmp(jmp_buf env, int retval);
// sigsetjmp和siglongjmp是setjmp和longjmp的可以被信号处理程序使用的版本
int sigsetjmp(sigjmp_buf env, int savesigs);
int siglongjmp(sigjmp_buf, int retval);

The setjmp function saves the current calling environment in the env buffer for use by the subsequent longjmp, and returns 0. The calling environment includes the program counter, stack pointer, and general purpose registers. Note that the value returned by setjmp cannot be assigned to a variable (you can think about the specific reason), but it can be safely used in a switch or conditional statement to test.

rc = setjmp(env);       // Wrong

The longjmp function restores the calling environment from the evn buffer, and then triggers a return from the last setjmp call that initialized env. Then setjmp returns with a non-zero return value retval.

The setjmp function is called only once, but returns multiple times: once when setjmp is called for the first time and the calling environment is saved in the buffer env; once for each corresponding longjmp call. On the other hand, the longjmp function is called once but never returns. An important application of non-local jumps is to run a deeply nested function call and return immediately, usually caused by the detection of an error condition. We can use non-local jumps to return directly to a normal localized error handler without having to unravel the call stack.

/* $begin setjmp */
#include "csapp.h"

jmp_buf buf;

int error1 = 0; 
int error2 = 1;

void foo(void), bar(void);

int main() 
{
    switch(setjmp(buf)) {
    case 0: 
    foo();
        break;
    case 1: 
    printf("Detected an error1 condition in foo\n");
        break;
    case 2: 
    printf("Detected an error2 condition in foo\n");
        break;
    default:
    printf("Unknown error condition in foo\n");
    }
    exit(0);
}

/* Deeply nested function foo */
void foo(void) 
{
    if (error1)
    longjmp(buf, 1); 
    bar();
}

void bar(void) 
{
    if (error2)
    longjmp(buf, 2); 
}
/* $end setjmp */

The feature of longjmp that allows it to skip intermediate calls may have serious consequences. If some resources (memory, network connections, etc.) are allocated in the intermediate function call and they were originally expected to be released at the end of the function, then these release codes will be skipped, resulting in resource leaks. Another important application of non-local jump is to make a signal handler branch to a special code location instead of returning to the location of the instruction interrupted by the signal arrival. For example, we can use sigsetjmp and siglongjmp to achieve soft restart.

/* $begin restart */
#include "csapp.h"

sigjmp_buf buf;

void handler(int sig) 
{
    siglongjmp(buf, 1);
}

int main() 
{
    // 首次调用返回0。当jump 回到这里后，返回非0
    if (!sigsetjmp(buf, 1)) {
        Signal(SIGINT, handler);
        Sio_puts("starting\n");
    }
    else 
    Sio_puts("restarting\n");

    while(1) {
    Sleep(1);
    Sio_puts("processing...\n");
    }
    exit(0); /* Control never reaches here */
}
/* $end restart */

The exception mechanism provided by C++ and JAVA is a higher level, which is a more structured version of the setjmp and longjmp functions of the C language. You can think of the catch in the try statement as similar to the setjmp function. Similarly, the trhow statement is similar to the longjmp function.

The following is an example of try-catch-throw. The program will always print "KeyboardInterrupt".

jmp_buf ex_buf__;

#define TRY do{ if(!setjmp(ex_buf__)) {
#define CATCH } else {
#define ETRY } } while(0)
#define THROW longjmp(ex_buf__, 1)

void sigint_handler(int sig) {
  THROW;
}

int main(void) {
  if (signal(SIGINT, sigint_handler) == SIG_ERR) {
    return 0;
  }

  TRY {
    // raise(sig)效果等同kill(getpid(), sig)
    raise(SIGINT); 
  } CATCH {
    printf("KeyboardInterrupt");
  }
  ETRY;
  return 0;
}

After the macro is expanded, the code is as follows:

jmp_buf ex_buf__;

void sigint_handler(int sig) {
  longjmp(ex_buf__, 1);
}

int main(void) {
  if (signal(SIGINT, sigint_handler) == ((_crt_signal_t)-1)) {
    return 0;
  }

  do{ if(!_setjmp(ex_buf__)) { {
        raise(SIGINT);
      } } else { {
        printf("KeyboardInterrupt");
      }
    } } while(0);

  return 0;
}

Tools for manipulating processes

The Linux system provides a large number of useful tools for monitoring and operating processes.

strace: Print the trace of each system call called by a running program and its child processes. Such as: strace cat /dev/null
ps: List the processes in the current system (including dead processes)
top: Print information about the resource usage of the current process
pmap: Display the memory map of the process
/proc: A virtual file system that outputs the contents of a large number of kernel data structures in ASCII text format, which can be read by user programs. For example, "cat /proc/loadavg" can see the average load of the current system

Every ordinary anomaly we experience may be a continuous miracle.