头图

图片

Text|Li Xudong (flower name: Xiang Xu)

Ant Group Technical Expert

Ant middleware research and development, focusing on the development and optimization of SOFARegistry and its surrounding infrastructure

This article is 2098 words read for 8 minutes

|Foreword|

MOSN uses the Subset algorithm as its label matching route load balancing method. This paper mainly introduces the principle of Subset, including some performance bottlenecks and optimization algorithms that MOSN's Subset encounters in ultra-large-scale clusters.

First, why optimize Subsets?

In general, performance bottlenecks tend to be gradually exposed as the size of the cluster increases. On the ant's large-scale cluster, the registry center pushes the address list, which will cause a certain overhead to the application.

In a large-scale stress test I have participated in, the number of machines for the core application is very large. When it is released or operated and maintained, its address list will be pushed to all applications that call it.

The MOSN will receive this address list to rebuild its own route. When the address list is very large, the performance bottleneck of the MOSN cluster update is gradually exposed, high CPU glitches appear, the memory increases in a short period of time, and the gc frequency also increases significantly.

Through the flame graph below, we can see the pprof of MOSN for an application during this stress test:

- Alloc:

image.png

- CPU:

图片

As can be seen from pprof, whether it is CPU or alloc overhead, the construction of SubsetLoadBalancer obviously accounts for the bulk, so optimizing this part of the construction is an urgent thing to do.

Ultimately by exploring optimizations, we were able to reduce 95% of the CPU overhead and 75% of the alloc overhead during the SubsetLoadBalancer build.

Let us review the process and ideas of this optimization together.

PART. 1--Subset basic principle introduction

In a cluster, machines usually have different labels, so how to route a request to a group of machines with a specified label?

MOSN's approach is to pre-group the machines under a service according to the machine tag combination to form multiple subsets. At the time of request, according to the metadata information in the request, the subset that should be matched can be quickly queried for the request.

As shown in the figure below, you can see that there are currently 4 nodes:

图片

The label matching rule will match the route according to the three fields of zone, mosn_aig, and mosn_version, and combine the order of these three keys to obtain the following matching path:

图片

The corresponding matching tree is as follows:

图片

Suppose you need to access {zone: zone1, mosn_aig: aig1}, then after sorting, the search sequence is mosn_aig:aig1 -> zone:zone1, and [h1, h2] is found.

The above is the basic principle of Subset.

PART. 2--MOSN's construction of Subset

There are two parameters that need to be entered first:

- Labeled machine list hosts, e.g. [h1, h2, h3, h4];

-subSetKeys for matching, as shown below:

图片

Next, let's bring ideas first, and then read the source code to see how MOSN's SubsetLoadBalancer builds this tree.

The core idea is as follows:

- Traverse the labels and subSetKeys of each host recursively to create a tree;

-For each node of the tree, it will traverse the hosts list once, filter out the subHosts that match the kvs of this node, and create a child load balancer for each node.

Let's look at the source code:

图片

The overall construction complexity is O (M N K) (M: number of Subset tree nodes, N: number of Hosts, K: matching Keys)

PART. 3--Construction performance bottleneck analysis

Through the analysis of the production profile, we found that the createSubsets of SubsetLoadBalancer have a high proportion in the flame graph of CPU and alloc. So let's start writing benchmarks to optimize the performance of this part.

Our input parameters are:

- subSetKeys:

图片

- 8000 hosts (each host has 4 labels, each label corresponds to 3 values) :

图片

Next, let's look at the occupancy of CPU and alloc_space.

- CPU:

图片

- alloc_space:

图片

From the above two flame graphs, we can see that HostMatches and setFinalHost take up more CPU_time and alloc_space. Let's first look at HostMatches:

图片

图片

Its function is to judge whether a host completely matches the given key-value pair, and to judge whether the host matches the matching tree node.

Its overhead is mainly due to the excessive number of executions: treeNodes * len(hosts) , so when the cluster becomes larger, the running overhead here will increase significantly.

Then we look at setFinalHost again:

图片

图片

His main logic is to deduplicate by IP, and at the same time, copy will be attached. If we deduplicate at the top level of SubsetLoadBalancer, then none of its Subsets need to be deduplicated again. Therefore, it can be changed here to not deduplicate.

PART. 4--Optimized construction of inverted index

In so many matches of HostMatches, there are actually many repeated operations, such as judging equals for a certain kv in the host label, which is repeated quite a lot during the construction process.

Therefore, the optimization idea can be based on avoiding the overhead of this part of the repetition, starting from the pre-construction of the inverted index. The specific steps are as follows:

1. Enter two parameters:

- subSetKeys:

图片

- hosts:

图片

2. Traverse the hosts once, and build an inverted index with bitmap for each kv:

图片

3. According to the subSetKeys and kvs in the inverted index, a matching tree is constructed, because the deduplication in the index has nothing to do with the number of hosts, and the cost of this operation is very low;

4. For each node of the tree, use the bitmap in the inverted index to do the intersection to quickly get the index bitmap of the hosts matching all kv;

5. Use the index stored in the bitmap to take out the corresponding subHosts from the hosts to build a sub-load balancer, and note that there is no need to use setFinalHosts for deduplication here.

Based on the above thinking process, a new Subset preIndex construction algorithm is developed. You can view the details on the corresponding Pull Request page of MOSN:

https://github.com/mosn/mosn/pull/2010

Then share the address of adding benchmark for testing:

https://github.com/mosn/mosn/blob/b0da8a69137cea3a60cdc6dfc0784d29c4c2e25a/pkg/upstream/cluster/subset_loadbalancer_test.go#L891

图片

图片

It can be seen that the construction speed is 20 times faster than the previous construction method, and the alloc_space is reduced by 75% . At the same time, there is a small increase in the number of allocs, which is due to the need to build an additional inverted index.

Let's observe the gc below:

图片

We can find that compared with the previous construction method, the memory during operation is smaller, and the memory reclaimed by the CPU is also reduced. At the same time, the duration of the parallel scan of gc has increased slightly, and the STW time has become shorter.

Finally, test the optimization degree under different numbers of hosts, we can see that when the number of hosts is large (>100) , the new construction algorithm will be significantly better than the old construction algorithm.

图片

PART. 5--Summary

We see that in large-scale production environments, some performance bottlenecks that were not noticed before are often exposed, but through stress testing, we can find and optimize these problems in advance.

At present, this build algorithm has been merged into MOSN master as the default SubsetLoadBalancer build method of MOSN.

In this optimization process, we used some common optimization methods, such as: inverted index, bitmap. It is not difficult to see that although these optimization methods are common, they have achieved ideal optimization results, and I hope they can be helpful to everyone.

understand more

MOSN Star ✨:

https://github.com/mosn/mosn

Recommended reading of the week

MOSN document usage guide

图片

MOSN 1.0 is released, starting the evolution of the new architecture

图片

Interview with MOSN Contributor|Open source can be done

图片

[2022 Summer of Open Source] SOFAStack and MOSN Community Project Selection Results

图片

图片


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。