Share 600,000 cash prizes, and the Cloud Native Programming Challenge is waiting for you to challenge!

About The 2nd Cloud Native Programming Challenge in 2021 is currently being recruited. This competition is hosted by Alibaba Cloud and Intel, and hosted by Alibaba Cloud Cloud Native and Alibaba Cloud Tianchi. Since 2015, the competition has been successfully held for six sessions, and has been upgraded to a cloud native programming challenge from 2020, attracting a total of more than 23,000 teams, covering more than 10 countries and regions.

Author | Xiang Sheng

The 2nd Cloud Native Programming Challenge in 2021 is currently being recruited. This competition is hosted by Alibaba Cloud and Intel, and hosted by Alibaba Cloud Cloud Native and Alibaba Cloud Tianchi. Since 2015, the competition has been successfully held for six sessions, and has been upgraded to a cloud native programming challenge from 2020, attracting a total of more than 23,000 teams, covering more than 10 countries and regions.

This competition will continue to explore in depth the three popular technical fields of RocketMQ, Dubbo3, and Serverless, and provide a stage for young people who love technology to challenge world-class technical problems. It is hoped that players will use technology to create greater value for the whole society. For the preliminary round, we have prepared three tracks for players to choose from. Are you ready?

0 (5).jpg .jpg")

This article mainly decrypts the second track: realizes a flexible cluster scheduling mechanism , hoping to provide some ideas for the players.

Share 600,000 cash prizes, choose any of the three major tracks,

There are even more wonderful tasks to define new poses to win prizes. Click to sign up! 👇

https://tianchi.aliyun.com/competition/entrance/531923/introduction?spm=5176.12281925.0.0.58987137KRXtxf

1. Background of the contest question

Cloud native has brought about a major change in technology standardization. How to make applications easier to create and run on the cloud, and have the ability to elastically expand, is the core goal of all cloud native basic components. With the elasticity brought by cloud native technology, applications can expand a large number of machines in a very short time to support business needs.

For example, in order to deal with zero-point spike scenarios or emergencies, the application itself often requires thousands or even tens of thousands of machines to improve performance to meet the needs of users, but the expansion also brings about node abnormalities such as the extremely large number of cluster nodes. Frequent occurrence and service capacity are affected by a variety of objective factors, leading to unequal node service capabilities and a series of large-scale cluster deployment problems in cloud native scenarios.

Dubbo expects to solve these problems based on a flexible cluster scheduling mechanism. This mechanism mainly solves two problems: first, in the case of abnormal nodes, distributed services can remain stable without avalanches and other issues; second, for large-scale applications, it can run in the best state and provide Higher throughput and performance.
From a single service perspective, Dubbo's desired goal is to provide an unbreakable service to the outside world, that is, when the number of requests is particularly high, it can selectively reject some requests to ensure the correctness and timeliness of the overall business.

From a distributed perspective, it is necessary to minimize the overall performance degradation due to complex topologies and different node performance. The flexible scheduling mechanism can dynamically allocate traffic in an optimal way, so that heterogeneous systems can accurately serve according to runtime Reasonable allocation of capacity requests to achieve optimal performance.

2. Problem analysis

The flexible scheduling of the cluster means that Dubbo can reasonably allocate requests from a global perspective to achieve cluster adaptation. Specifically, it enables consumers to quickly perceive the random changes in the performance of server-side nodes, and by adjusting the proportional distribution of the number of requests sent to different server-side nodes, it becomes more reasonable, so that even if Dubbo encounters problems caused by large-scale cluster deployment, It can also provide the best performance.

The scenarios that the flexible scheduling mechanism mainly solves are as follows:

Multiple computer rooms are deployed in different places, and the network packet loss is serious.

With the continuous development of business, more and more users are being reached by the current business, and the computing capacity required by the server is also increasing. In addition, due to the ever-increasing size of applications, the number of upstream applications relied on under the microservice architecture split system is also increasing. For a single computer room, the machine capacity that can be provided is limited. Therefore, whether it is to solve the problem of a large number of machines, or to ensure high availability of the business, it is necessary to do multiple activities in different places. For the business side, it is becoming more and more common for multiple computer rooms to be deployed in different places.

It involves the deployment of multiple computer rooms in different places. For deployment in the same city, the network between the computer rooms can still use methods such as renting bare fiber to ensure the stability of the network. However, once multiple computer rooms are deployed in different cities or even different countries, the network will lose serious packets. The problem will become more and more serious.

This question simulates this situation by simulating the server randomly discarding part of the request. It is intended to see a mechanism that automatically returns to the call based on delay learning, whether it is to return the failed result to the call in time The end may initiate a retry to ensure the overall availability of the service is improved.

server processing performance is limited, the greater the concurrency, the slower the processing

For general business scenarios, a single service is often not a stand-alone operation. It is more that three-party components such as databases need to be connected. Many of these components have an upper limit on the overall number of concurrency, that is, when the request concurrency reaches a certain value, the remaining The following requests will be queued, which will increase the overall processing delay. To put it another way, even from the stand-alone point of view, since the number of CPU cores of a stand-alone machine is much smaller than the number of concurrent threads, when the number of concurrent threads is particularly high, more resources will be spent on context switching. Moreover, if the number of concurrent services is too large, it is easy to cause the problem of overheating of the single point of service, and then the single point of service is overwhelmed, which may lead to a service avalanche in the entire cluster.

The figure below is a comparison chart of the total service throughput when an exponential function (this model should not be relied on in the evaluation) simulates the relationship between the number of concurrency and latency.

赛道2文章配图.png

It can be seen that the number of requests that can be processed per unit time is not the higher the number of concurrent requests. In this competition, I hope to see a mechanism that can automatically analyze the optimal number of concurrent requests for upstream services, and in turn, achieve a greater number of services successfully requested per unit time.

performance of the host is over-matched, and the performance of the server is unstable.

With the advent of the cloud-native era, more and more applications are deployed in containers. Both the container itself and the IaaS facilities such as Alibaba Cloud ECS used by ordinary users on the cloud are running in a virtualized environment. of. But in a virtualized environment, the host's resources (including CPU cache and memory bandwidth) are shared. If an application that consumes the CPU cache quickly consumes the L3 cache, or an application consumes a large amount of system memory bandwidth, it will cause interference to other "neighbors" running on the same host. However, it is often difficult to predict which machine will have this situation during application deployment. Therefore, we expect the RPC service framework to actively adapt to this fluctuation, dynamically adjust the proportion of calls between machines, and finally achieve the highest possible service. capacity.

3. Problem solving ideas

Capacity assessment

The design goal of this question is an automated scheduling mechanism based on capacity. Evaluating the optimal service capacity is a prerequisite for completing this question. Based on the maximum service capacity and expected call delay, it can provide a macro data basis for overall traffic scheduling. Generally, the capacity evaluation of the cluster is determined by the actual online pressure measurement, and the dynamic calculation of the runtime can be based on the average time consumption of interface response, P999 time consumption, error rate and other data for capacity evaluation. At the same time, we should evaluate the capacity in various situations as much as possible to avoid falling into the local optimal solution.

Based on the capacity information of different servers, the consumer can control the amount of concurrent requests to the server to achieve the goal of the highest overall number of requests.

fail fast

When a request is discarded by the server or during network transmission, it usually takes a long time (timeout configuration) for the consumer to discover this situation. For example, an expected request delay is 10ms. If you wait until the timeout of 5000ms to report an error and retry, it will waste a lot of time waiting for it. If the rapid failure processing of a specific interface can be performed based on the expected actual delay, it can greatly save invalid waiting time.

automatic detection

Since the performance of the server is changing in real time, the number of concurrent calls to the server cannot always be fixed at a value. It needs to be tested dynamically within a certain range. If a better capacity is found, the call parameters need to be automatically adjusted. Finally, as far as possible to achieve the optimal solution at all times.

4. Summary

This article analyzes the topic of this competition from the perspective of the background of the competition, the analysis of the competition and the idea of solving the problem, and introduces the basic design ideas of the flexible cluster scheduling algorithm. I hope it will be helpful to the students who are about to participate in the competition. I wish all the contestants excellent results and advance to the semi-finals and finals. We are waiting for you in the final defense.

5. Registration method

[Track 1] RocketMQ storage system design for hot and cold reading and writing scenarios

https://tianchi.aliyun.com/competition/entrance/531922/introduction?spm=5176.12281925.0.0.58987137KRXtxf
[Track 2] Implement a flexible cluster scheduling mechanism
https://tianchi.aliyun.com/competition/entrance/531923/introduction?spm=5176.12281925.0.0.58987137KRXtxf
[Track 3] Less is more-Serverless Innovation Application Competition
https://tianchi.aliyun.com/competition/entrance/531924/introduction?spm=5176.12281925.0.0.58987137KRXtxf

poke 👇 immediately register for the competition: