Linux kernel scheduler source code analysis-initialization

Lead

The last series of articles Mixed Department-On the CPU Isolation of Cloud Native Resource Isolation Technology (1) introduced the core technology of CPU resource isolation in the cloud native mixed scene: the kernel scheduler, this series of articles "Linux kernel scheduler source code "Analysis" will analyze the specific principles and implementation of kernel scheduling from the perspective of source code. We will take Linux kernel version 5.4 (TencentOS Server3 default kernel version) as the object, starting from the initialization code of the scheduler subsystem, and analyze the design of the Linux kernel scheduler And realization.

The Scheduler subsystem is one of the core subsystems of the kernel. It is responsible for the reasonable allocation of CPU resources in the system. It needs to be able to handle the scheduling requirements of complex and different types of tasks, and it also needs to be able to handle various complex concurrent competition environments. At the same time, the overall throughput performance and real-time requirements need to be taken into account (itself is a pair of contradictions), and its design and implementation are extremely challenging.

In order to understand the design and implementation of the Linux scheduler, we will take the Linux kernel version 5.4 (the default kernel version of TencentOS Server3) as the object, start with the initialization code of the scheduler subsystem, and analyze the design and implementation of the Linux kernel scheduler.

This (series) article analyzes the design and implementation of the Linux scheduler (mainly for CFS), hoping to make readers understand:

Basic concepts of scheduler
Initialization of the scheduler (including various related to the scheduling domain)
Process creation, execution and destruction
Principle and Implementation of Process Switching
CFS process scheduling strategy (single core)
How to ensure the reasonable use of CPU resources in the scheduling of the global system
How to balance the relationship between CPU cache popularity and CPU load
Very special scheduler features analysis

Basic concepts of scheduler

Before analyzing the relevant code of the scheduler, you need to understand the core data (structure) involved in the scheduler and their role

Run queue (rq)

The kernel will create a run queue for each CPU, the ready state (tasks in the Running state) in the system will be organized on the kernel run queue, and then according to the corresponding strategy, the process on the run queue is scheduled to the CPU carried out.

Scheduling class (sched_class)

The kernel has highly abstracted the scheduling strategy (sched_class) to form a scheduling class (sched_class). Through the scheduling class, the common code (mechanism) of the scheduler can be fully decoupled from the scheduling strategies provided by the specific scheduling classes, which is a typical OO (object-oriented) idea. Through this design, the kernel scheduler can be extremely scalable. Developers can add a new scheduling class with very little code (basically without changing the common code), thereby realizing a brand new scheduler (class) For example, the deadline scheduling class is newly added in 3.x. From the code level, only the relevant implementation function of the dl_sched_class structure is added, which makes it easy to add a new real-time scheduling type.

The current 5.4 kernel has 5 scheduling classes, and the priorities are distributed as follows:

stop_sched_class:

The scheduling class with the highest priority, like idle_sched_class, is a dedicated scheduling type (except for the migration thread, other tasks cannot or should not be set to the stop scheduling class). This scheduling class is dedicated to implementing "urgent" tasks that rely on migration threads such as active balance or stop machine.

dl_sched_class:

The priority of the deadline scheduling class is second only to the stop scheduling class. It is a real-time scheduler (or scheduling strategy) based on the EDL algorithm.

rt_sched_class:

The priority of the rt scheduling class is lower than that of the dl scheduling class, which is a real-time scheduler based on priority.

fair_sched_class:

The priority of the CFS scheduler is lower than the above three scheduling classes. It is a scheduling type designed based on the idea of fair scheduling and is the default scheduling class of the Linux kernel.

idle_sched_class:

The idle scheduling type is swapper thread, mainly to let swapper thread take over the CPU, and let the CPU enter the energy-saving state through frameworks such as cpuidle/nohz.

Scheduling domain (sched_domain)

The scheduling domain was introduced into the kernel in 2.6. Through the introduction of the multi-level scheduling domain, the scheduler can better adapt to the physical characteristics of the hardware (the scheduling domain can better adapt to the CPU multi-level cache and NUMA physical characteristics for load balancing. Challenges brought by) to achieve better scheduling performance (sched_domain is a mechanism developed for CFS scheduling load balancing).

Scheduling group (sched_group)

The scheduling group is introduced into the kernel together with the scheduling domain. It will cooperate with the scheduling domain to assist the CFS scheduler to complete load balancing among multiple cores.

Root domain (root_domain)

The root domain is mainly responsible for the data structure designed for load balancing of the real-time scheduling class (including the dl and rt scheduling classes), and assists the dl and rt scheduling classes to complete the reasonable scheduling of real-time tasks. When you do not use isolate or cpuset cgroup to modify the scheduling domain, all CPUs will be in the same root domain by default.

Group scheduling (group_sched)

In order to be able to control the resources in the system more finely, the kernel introduces the cgroup mechanism for resource control. And group_sched is the underlying implementation mechanism of cpu cgroup. Through cpu cgroup, we can set some processes as a cgroup, and configure the corresponding bandwidth and share parameters through the control interface of cpu cgroup, so that we can use the group as the unit to control the CPU Fine control of resources.

Scheduler initialization (sched_init)

Now let’s enter the topic and start to analyze the initialization process of the kernel scheduler. I hope that through the analysis here, everyone can understand:

1. How is the run queue initialized?

2. How is group scheduling associated with rq (group scheduling can only be performed through group_sched after association)

3. CFS soft interrupt SCHED_SOFTIRQ registration

Scheduling initialization (sched_init)

start_kernel

|----setup_arch

|----build_all_zonelists

|----mm_init

|----sched_init Scheduling initialization

Scheduling initialization is located relatively late in start_kernel. At this time, memory initialization has been completed, so you can see that memory application functions such as kzmalloc can already be called in sched_init.

sched_init needs to initialize the run queue (rq) for each CPU, the global default bandwidth of dl/rt, the run queue of each scheduling class, and CFS soft interrupt registration.

Next we look at the specific implementation of sched_init (part of the code is omitted):

void __init sched_init(void)
{
    unsigned long ptr = 0;
    int i;
 
    /*
     * 初始化全局默认的rt和dl的CPU带宽控制数据结构
     *
     * 这里的rt_bandwidth和dl_bandwidth是用来控制全局的DL和RT的使用带宽，防止实时进程
     * CPU使用过多，从而导致普通的CFS进程出现饥饿的情况
     */
    init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
    init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime());
 
#ifdef CONFIG_SMP
    /*
     * 初始化默认的根域
     *
     * 根域是dl/rt等实时进程做全局均衡的重要数据结构，以rt为例
     * root_domain->cpupri 是这个根域范围内每个CPU上运行的RT任务的最高优先级，以及
     * 各个优先级任务在CPU上的分布情况，通过cpupri的数据，那么在rt enqueue/dequeue
     * 的时候，rt调度器就可以根据这个rt任务分布情况来保证高优先级的任务得到优先
     * 运行
     */
    init_defrootdomain();
#endif
 
#ifdef CONFIG_RT_GROUP_SCHED
    /*
     * 如果内核支持rt组调度(RT_GROUP_SCHED), 那么对RT任务的带宽控制将可以用cgroup
     * 的粒度来控制每个group里rt任务的CPU带宽使用情况
     *
     * RT_GROUP_SCHED可以让rt任务以cpu cgroup的形式来整体控制带宽
     * 这样可以为RT带宽控制带来更大的灵活性(没有RT_GROUP_SCHED的时候，只能控制RT的全局
     * 带宽使用，不能通过指定group的形式控制部分RT进程带宽)
     */
    init_rt_bandwidth(&root_task_group.rt_bandwidth,
            global_rt_period(), global_rt_runtime());
#endif /* CONFIG_RT_GROUP_SCHED */
 
    /* 为每个CPU初始化它的运行队列 */
    for_each_possible_cpu(i) {
        struct rq *rq;
 
        rq = cpu_rq(i);
        raw_spin_lock_init(&rq->lock);
        /*
         * 初始化rq上cfs/rt/dl的运行队列
         * 每个调度类型在rq上都有各自的运行队列，每个调度类都是各自管理自己的进程
         * 在pick_next_task()的时候，内核根据调度类优先级的顺序，从高到底选择任务
         * 这样就保证了高优先级调度类任务会优先得到运行
         *
         * stop和idle是特殊的调度类型，是为专门的目的而设计的调度类，并不允许用户
         * 创建相应类型的进程，所以内核也没有在rq里设计对应的运行队列
         */
        init_cfs_rq(&rq->cfs);
        init_rt_rq(&rq->rt);
        init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
        /*
         * CFS的组调度(group_sched)，可以通过cpu cgroup来对CFS进行进行控制
         * 可以通过cpu.shares来提供group之间的CPU比例控制(让不同的cgroup按照对应
         * 的比例来分享CPU)，也可以通过cpu.cfs_quota_us来进行配额设定(与RT的
         * 带宽控制类似)。CFS group_sched带宽控制是容器实现的基础底层技术之一
         *
         * root_task_group 是默认的根task_group，其他的cpu cgroup都会以它做为
         * parent或者ancestor。这里的初始化将root_task_group与rq的cfs运行队列
         * 关联起来，这里做的很有意思，直接将root_task_group->cfs_rq[cpu] = &rq->cfs
         * 这样在cpu cgroup根下的进程或者cgroup tg的sched_entity会直接加入到rq->cfs
         * 队列里，可以减少一层查找开销。
         */
        root_task_group.shares = ROOT_TASK_GROUP_LOAD;
        INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
        rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
        init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
        init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
 
        rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
        /* 初始化rq上的rt运行队列，与上面的CFS的组调度初始化类似 */
        init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
 
#ifdef CONFIG_SMP
        /*
         * 这里将rq与默认的def_root_domain进行关联，如果是SMP系统，那么后面
         * 在sched_init_smp的时候，内核会创建新的root_domain，然后替换这里
         * def_root_domain
         */
        rq_attach_root(rq, &def_root_domain);
#endif /* CONFIG_SMP */
    }
 
    /*
     * 注册CFS的SCHED_SOFTIRQ软中断服务函数
     * 这个软中断住要是周期性负载均衡以及nohz idle load balance而准备的
     */
    init_sched_fair_class();
 
    scheduler_running = 1;
}

Multi-core scheduling initialization (sched_init_smp)

start_kernel

|----rest_init

|----kernel_init

|----kernel_init_freeable

|----smp_init

|----sched_init_smp

|---- sched_init_numa

|---- sched_init_domains

|---- build_sched_domains

Multi-core scheduling initialization is mainly to complete the initialization of the scheduling domain/scheduling group (of course the root domain will also do it, but relatively speaking, the initialization of the root domain will be relatively simple).

Linux is an operating system that can run on multiple chip architectures and multiple memory architectures (UMA/NUMA), so Linu x needs to be able to adapt to multiple physical structures, so its scheduling domain design and implementation are relatively complicated. of.

Principles of Scheduling Domain Implementation

Before talking about the specific scheduling domain initialization code, we need to understand the relationship between the scheduling domain and the physical topology (because the design of the scheduling domain is closely related to the physical topology, if you do not understand the physical topology, then there is no way to truly Understand the implementation of the scheduling domain)

The physical topology of the CPU

We assume a computer system (similar to an intel chip, but reduce the number of CPU cores for convenience):

For a dual-socket computer system, each socket is composed of 2 cores and 4 threads, then this computer system should be a 4-core 8-thread NUMA system (the above is just the physical topology of Intel, and the AMD ZEN architecture uses chiplet Design, it will have a layer of DIE domain between MC and NUMA domain).

The first layer (SMT domain):

As CORE0 in the figure above, 2 hyperthreads form the SMT domain. For intel cpu, Hyper-Threading shares L1 and L2 (even store buffe is shared to a certain extent), so there is no cache heat loss for mutual migration between SMT domains

The second layer (MC domain):

As shown above, CORE0 and CORE1 are located in the same SOCKET and belong to the MC domain. For intel CPUs, they generally share LLC (usually L3). Although process migration in this domain will lose the heat of L1 and L2, the cache heat of L3 can still be maintained.

third layer (NUMA domain):

As shown in the above figure, SOCKET0 and SOCKET1, the process migration between them will cause the loss of all cache heat, and there will be a large overhead, so the migration of the NUMA domain needs to be relatively cautious.

It is precisely because of such hardware physical characteristics (hardware factors such as cache heat at different levels, NUMA access latency, etc.) that the kernel abstracts sched_domain and sched_group to represent such physical characteristics. When doing load balancing, according to the corresponding scheduling domain characteristics, make different scheduling strategies (such as load balancing frequency, unbalanced factor, wake-up core selection logic, etc.), so as to improve the affinity of CPU load and cache. Good balance.

Implementation of scheduling domain

Next, we can see how the kernel builds the scheduling domain and scheduling group on the above physical topology.

The kernel will establish a corresponding level of scheduling domain according to the physical topology, and then establish a corresponding scheduling group on each level of the scheduling domain. The scheduling domain is doing load balancing, which is to find the busiest sg (sched_group) with the heaviest load in the scheduling domain of the corresponding level, and then judge whether the load of buiest sg and local sg (but the scheduling group where the previous CPU is located) is uneven. If there is an uneven load, the buisest cpu will be selected from the buiest sg, and then the load will be balanced between the two CPUs.

The SMT domain is the lowest-level scheduling domain, and you can see that each hyper-threaded pair is an smt domain. There are 2 sched_groups in the smt domain, and each sched_group only has one CPU. Therefore, the load balancing of the smt domain is to perform process migration between hyperthreads. This load balancing has the shortest time and the most relaxed conditions.

For architectures where there is no hyper-threading (or that the chip does not have hyper-threading enabled), then the lowest level domain is the MC domain (this time there are only two-level domains, MC and NUMA). In this way, each CORE in the MC domain is a sched_group, and the kernel can also be well adapted to such a scenario when scheduling.

The MC domain is composed of all the CPUs of the CPU on the socket, and each sg is composed of all the CPUs of the upper smt domain. So for the above figure, the sg of MC consists of 2 CPUs. The kernel is designed like this in the MC domain, so that the CFS scheduling class requires balance between the sgs of the MC domain when waking up load balancing and idle load balancing.

This design is very important for hyper-threading, and we can also observe such a situation in some actual businesses. For example, we have a codec business, and found that it has better test data in some virtual machines, but poor test data in some virtual machines. After analysis, it is found that this is caused by whether the hyperthreading information is transparently transmitted to the virtual machine. When we transparently transmit hyperthreading information to the virtual machine, the virtual machine forms a two-layer scheduling domain (SMT and MC domains), and when waking up load balancing, CFS will tend to schedule the business to the idle sg (that is, the idle sg). Physical CORE, not idle CPU). At this time, the business can make full use of the performance of the physical CORE when the CPU utilization is not high (no more than 40%) (still an old problem, a hyperthreading pair on a physical CORE , When they run CPU-consuming services at the same time, the performance gain obtained is only about 1.2 times that of a single thread.), so as to obtain better performance gains. And if there is no transparent transmission of hyper-threading information, then the virtual machine has only one layer of physical topology (MC domain), then because the business is likely to be scheduled to pass through a physical CORE hyper-threading pair, this will cause the system to fail to make full use of the physical CORE. Performance, resulting in low business performance.

The NUMA domain is composed of all CPUs in the system. All CPUs on SOCKET form a sg. The NUMA domain in the figure above is composed of 2 sgs. When there needs to be a large imbalance between NUMA sg (and the imbalance here is at the sg level, that is, the sum of all CPU loads on the sg must be unbalanced with another sg), in order to perform cross-NUMA process migration (because of cross- NUMA migration will cause all the cache heat loss of L1, L2, L3, and may cause more cross-NUMA memory accesses, so you need to be careful).

As can be seen from the above introduction, through the cooperation of sched_domain and sched_group, the kernel can adapt to various physical topologies (whether hyper-threading is turned on, whether NUMA is turned on), and efficiently use CPU resources.

smp_init

/*
 * Called by boot processor to activate the rest.
 *
 * 在SMP架构里，BSP需要将其他的非boot cp全部bring up
 */
void __init smp_init(void)
{
    int num_nodes, num_cpus;
    unsigned int cpu;
 
    /* 为每个CPU创建其idle thread */
    idle_threads_init();
    /* 向内核注册cpuhp线程 */
    cpuhp_threads_init();
 
    pr_info("Bringing up secondary CPUs ...\n");
 
    /*
     * FIXME: This should be done in userspace --RR
     *
     * 如果CPU没有online，则用cpu_up将其bring up
     */
    for_each_present_cpu(cpu) {
        if (num_online_cpus() >= setup_max_cpus)
            break;
        if (!cpu_online(cpu))
            cpu_up(cpu);
    }
     
    .............
}

Before actually starting the initialization of the sched_init_smp scheduling domain, you need to bring up all non-boot cpus to ensure that these CPUs are in the ready state, and then you can start the initialization of the multi-core scheduling domain.

sched_init_smp

Then here we take a look at the specific code implementation of multi-core scheduling initialization (if CONFIG_SMP is not configured, then the related implementation here will not be executed)

sched_init_numa

sched_init_numa() is used to detect whether it is NUMA in the system, if it is, you need to dynamically add a NUMA domain.

/*
 * Topology list, bottom-up.
 *
 * Linux默认的物理拓扑结构
 *
 * 这里只有三级物理拓扑结构，NUMA域是在sched_init_numa()自动检测的
 * 如果存在NUMA域，则会添加对应的NUMA调度域
 *
 * 注：这里默认的 default_topology 调度域可能会存在一些问题，例如
 * 有的平台不存在DIE域(intel平台)，那么就可能出现LLC与DIE域重叠的情况
 * 所以内核会在调度域建立好后，在cpu_attach_domain()里扫描所有调度
 * 如果存在调度重叠的情况，则会destroy_sched_domain对应的重叠调度域
 */
static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
    { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
    { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif
    { cpu_cpu_mask, SD_INIT_NAME(DIE) },
    { NULL, },
};

Linux default physical topology

/*
 * NUMA调度域初始化(根据硬件信息创建新的sched_domain_topology物理拓扑结构)
 *
 * 内核在默认情况下并不会主动添加NUMA topology，需要根据配置(如果开启了NUMA)
 * 如果开启了NUMA，这里就要根据硬件拓扑信息来判断是否需要添加
 * sched_domain_topology_level 域(只有添加了这个域之后，内核才会在后面初始化
 * sched_domain的时候创建NUMA DOMAIN)
 */
void sched_init_numa(void)
{
    ...................
    /*
     * 这里会根据distance检查是否存在NUMA域(甚至存在多级NUMA域)，然后根据
     * 情况将其更新到物理拓扑结构里。后面的建立调度域的时候，就会这个新的
     * 物理拓扑结构来建立新的调度域
     */
    for (j = 1; j < level; i++, j++) {
        tl[i] = (struct sched_domain_topology_level){
            .mask = sd_numa_mask,
            .sd_flags = cpu_numa_flags,
            .flags = SDTL_OVERLAP,
            .numa_level = j,
            SD_INIT_NAME(NUMA)
        };
    }
 
    sched_domain_topology = tl;
 
    sched_domains_numa_levels = level;
    sched_max_numa_distance = sched_domains_numa_distance[level - 1];
 
    init_numa_topology_type();
}

Check the physical topology of the system. If there is a NUMA domain, it needs to be added to sched_domain_topology. Later, the corresponding scheduling domain will be established according to the physical topology of sched_domain_topology.

sched_init_domains

Next, analyze the sched_init_domains scheduling domain establishment function

/*
 * Set up scheduler domains and groups.  For now this just excludes isolated
 * CPUs, but could be used to exclude other special cases in the future.
 */
int sched_init_domains(const struct cpumask *cpu_map)
{
    int err;
 
    zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
    zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
    zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
    arch_update_cpu_topology();
    ndoms_cur = 1;
    doms_cur = alloc_sched_domains(ndoms_cur);
    if (!doms_cur)
        doms_cur = &fallback_doms;
    /*
     * doms_cur[0] 表示调度域需要覆盖的cpumask
     *
     * 如果系统里用isolcpus=对某些CPU进行了隔离，那么这些CPU是不会加入到调度
     * 域里面，即这些CPU不会参于到负载均衡(这里的负载均衡包括DL/RT以及CFS)。
     * 这里用 cpu_map & housekeeping_cpumask(HK_FLAG_DOMAIN) 的方式将isolate
     * cpu去除掉，从而在保证建立的调度域里不包含isolate cpu
     */
    cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_FLAG_DOMAIN));
    /* 调度域建立的实现函数 */
    err = build_sched_domains(doms_cur[0], NULL);
    register_sched_domain_sysctl();
 
    return err;
}

/*
 * Build sched domains for a given set of CPUs and attach the sched domains
 * to the individual CPUs
 */
static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
    enum s_alloc alloc_state = sa_none;
    struct sched_domain *sd;
    struct s_data d;
    struct rq *rq = NULL;
    int i, ret = -ENOMEM;
    struct sched_domain_topology_level *tl_asym;
    bool has_asym = false;
 
    if (WARN_ON(cpumask_empty(cpu_map)))
        goto error;
 
    /*
     * Linux里的绝大部分进程都为CFS调度类，所以CFS里的sched_domain将会被频繁
     * 的访问与修改(例如nohz_idle以及sched_domain里的各种统计)，所以sched_domain
     * 的设计需要优先考虑到效率问题，于是内核采用了percpu的方式来实现sched_domain
     * CPU间的每级sd都是独立申请的percpu变量，这样可以利用percpu的特性解决它们
     * 间的并发竞争问题(1、不需要锁保护 2、没有cachline伪共享)
     */
    alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
    if (alloc_state != sa_rootdomain)
        goto error;
 
    tl_asym = asym_cpu_capacity_level(cpu_map);
 
    /*
     * Set up domains for CPUs specified by the cpu_map:
     *
     * 这里会遍历cpu_map里所有CPU，为这些CPU创建与物理拓扑结构对应(
     * for_each_sd_topology)的多级调度域。
     *
     * 在调度域建立的时候，会通过tl->mask(cpu)获得cpu在该级调度域对应
     * 的span(即cpu与其他对应的cpu组成了这个调度域)，在同一个调度域里
     * 的CPU对应的sd在刚开始的时候会被初始化成一样的(包括sd->pan、
     * sd->imbalance_pct以及sd->flags等参数)。
     */
    for_each_cpu(i, cpu_map) {
        struct sched_domain_topology_level *tl;
 
        sd = NULL;
        for_each_sd_topology(tl) {
            int dflags = 0;
 
            if (tl == tl_asym) {
                dflags |= SD_ASYM_CPUCAPACITY;
                has_asym = true;
            }
 
            sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
 
            if (tl == sched_domain_topology)
                *per_cpu_ptr(d.sd, i) = sd;
            if (tl->flags & SDTL_OVERLAP)
                sd->flags |= SD_OVERLAP;
            if (cpumask_equal(cpu_map, sched_domain_span(sd)))
                break;
        }
    }
 
    /*
     * Build the groups for the domains
     *
     * 创建调度组
     *
     * 我们可以从2个调度域的实现看到sched_group的作用
     * 1、NUMA域 2、LLC域
     *
     * numa sched_domain->span会包含NUMA域上所有的CPU，当需要进行均衡的时候
     * NUMA域不应该以cpu为单位，而是应该以socket为单位，即只有socket1与socket2
     * 极度不平衡的时候才在这两个SOCKET间迁移CPU。如果用sched_domain来实现这个
     * 抽象则会导致灵活性不够(后面的MC域可以看到)，所以内核会以sched_group来
     * 表示一个cpu集合，每个socket属于一个sched_group。当这两个sched_group不平衡
     * 的时候才会允许迁移
     *
     * MC域也是类似的，CPU可能是超线程，而超线程的性能与物理核不是对等的。一对
     * 超线程大概等于1.2倍于物理核的性能。所以在调度的时候，我们需要考虑超线程
     * 对之间的均衡性，即先要满足CPU间均衡，然后才是CPU内的超线程均衡。这个时候
     * 用sched_group来做抽象，一个sched_group表示一个物理CPU(2个超线程)，这个时候
     * LLC保证CPU间的均衡，从而避免一种极端情况：超线程间均衡，但是物理核上不均衡
     * 的情况，同时可以保证调度选核的时候，内核会优先实现物理线程，只有物理线程
     * 用完之后再考虑使用另外的超线程，让系统可以更充分的利用CPU算力
     */
    for_each_cpu(i, cpu_map) {
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            sd->span_weight = cpumask_weight(sched_domain_span(sd));
            if (sd->flags & SD_OVERLAP) {
                if (build_overlap_sched_groups(sd, i))
                    goto error;
            } else {
                if (build_sched_groups(sd, i))
                    goto error;
            }
        }
    }
 
    /*
     * Calculate CPU capacity for physical packages and nodes
     *
     * sched_group_capacity 是用来表示sg可使用的CPU算力
     *
     * sched_group_capacity 是考虑了每个CPU本身的算力不同(最高主频设置不同、
     * ARM的大小核等等)、去除掉RT进程所使用的CPU(sg是为CFS准备的，所以需要
     * 去掉CPU上DL/RT进程等所使用的CPU算力)等因素之后，留给CFS sg的可用算力(因为
     * 在负载均衡的时候，不仅应该考虑到CPU上的负载，还应该考虑这个sg上的CFS
     * 可用算力。如果这个sg上进程较少，但是sched_group_capacity也较小，也是
     * 不应该迁移进程到这个sg上的)
     */
    for (i = nr_cpumask_bits-1; i >= 0; i--) {
        if (!cpumask_test_cpu(i, cpu_map))
            continue;
 
        for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
            claim_allocations(i, sd);
            init_sched_groups_capacity(i, sd);
        }
    }
 
    /* Attach the domains */
    rcu_read_lock();
    /*
     * 将每个CPU的rq与rd(root_domain)进行绑定，并且会检查sd是否有重叠
     * 如果是的则需要用destroy_sched_domain()将其去掉(所以我们可以看到
     * intel的服务器是只有3层调度域，DIE域其实与LLC域重叠了，所以在这里
     * 会被去掉)
     */
    for_each_cpu(i, cpu_map) {
        rq = cpu_rq(i);
        sd = *per_cpu_ptr(d.sd, i);
 
        /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
        if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
            WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
 
        cpu_attach_domain(sd, d.rd, i);
    }
    rcu_read_unlock();
 
    if (has_asym)
        static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
 
    if (rq && sched_debug_enabled) {
        pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
            cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
    }
 
    ret = 0;
error:
    __free_domain_allocs(&d, alloc_state, cpu_map);
 
    return ret;
}

So far, we have built the kernel scheduling domain, and CFS can use sched_domain to complete load balancing between multiple cores.

Conclusion

This article mainly introduces the basic concepts of the kernel scheduler, and through the analysis of the initialization code of the scheduler in the 5.4 kernel, introduces the specific implementation methods of the basic concepts such as the scheduling domain and the scheduling group. On the whole, compared with the 3.x kernel, the 5.4 kernel has no essential changes in the scheduler initialization logic and the basic design (concept/key structure) related to the scheduler, which also confirms the "stability" of the kernel scheduler design from the side. And "elegance".

Preview: The next article in this series will focus on the basic principles, basic framework and related source code of the Linux kernel scheduler, so stay tuned.

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !