author

Chang Yaoguo, Tencent SRE expert, currently works in the PCG-Big Data Platform Department, responsible for cloud migration, monitoring and automation of tens of millions of QPS businesses.

background

BeaconLogServer is the portal for reporting data from the Beacon SDK. It receives data reports from many businesses, including Weishi, QQ, Tencent Video, QQ browser, and App Store. It presents problems such as large concurrency, large requests, and sudden increase in traffic. The QPS of BeaconLogServer reaches more than tens of millions. In order to deal with these problems, it usually takes a lot of manpower to maintain the capacity level of the service. How to use the cloud to achieve zero manpower operation and maintenance is the focus of this article.

Hybrid cloud elastic scaling

The overall effect of elastic scaling

First, let’s talk about automatic expansion and contraction. The figure below is a scheme of BeaconLogServer hybrid cloud elastic scaling design.

Elastic scaling solution

Resource management

Let’s start with resource management. At present, the number of BeaconLogServer nodes is more than 8000, which requires a lot of resources. Relying on the public resources of the platform alone, it may not be possible to achieve rapid expansion when the traffic increases during some holidays. Therefore, after investigation 123 platform (PAAS platform) and computing power platform (resource platform), we adopted a hybrid cloud approach to solve this problem.
Analyzing the BLS business scenario, there are two situations for sudden increase in traffic:

  • Daily business load increased slightly and lasted for a short period of time
  • During the Spring Festival, the business load increased significantly and lasted for a period of time
    For the above business scenarios, we use three resource types to deal with different scenarios, as described in the following table:
typeScenesset
Public resource poolDaily businessbls.sh.1
Computing platformSmall peakbls.sh.2
Dedicated resource poolSpring Festivalbls.sh.3

For daily business, we use public resource pool + computing power resources. When the load of the business rises slightly, use computing power resources to quickly expand capacity to ensure that the service capacity level does not exceed the safety threshold. Faced with a substantial increase in load during the Spring Festival, a dedicated resource pool needs to be constructed to cope with the increase in traffic during the Spring Festival.

Flexible expansion and contraction

The above explained the management of resources, then for different resources, when to start to expand and when to start to shrink?

The daily traffic distribution of is 161a9eaad99ad2 123 platform public resources: computing power platform=7:3 . The current threshold for automatic expansion is 60%. When the CPU usage rate is greater than 60%, the platform automatically expands. The elastic expansion and contraction relies on the scheduling function of the 123 platform. The specific indicators are set as follows:

typeCPU automatic scaling thresholdCPU automatic expansion thresholdMinimum number of copiesMaximum number of copies
123 platform public resource pool20603001000
Computing platform40503001000
123 platform proprietary resource pool20603001000

It can be seen that the automatic shrinkage threshold of the computing power platform is relatively large, and the automatic expansion threshold is small. The main consideration is that the computing platform is to deal with the sudden increase in traffic, and the computing power platform resources are frequently replaced, so the priority is to expand or shrink the computing power first Resources of the platform. The minimum number of copies is the minimum resource requirement required to guarantee the business. If it is less than this value, the platform will automatically supplement it. The maximum number of copies is set to 1000 because the maximum number of RS nodes supported by a city on the IAS platform (gateway platform) is 1000.

Problem and solution

In the process of advancing the plan, we also encountered many problems. Pick a few problems to share with you.

1) First of all, from the perspective of the access layer, the previous access layer service used TGW. TGW has a limitation, that is, RS nodes cannot exceed 200. At present, there are more than 8000 BeaconLogServer nodes. To continue using TGW, you need to apply for many For domain names, migration is time-consuming and inconvenient to maintain. We investigated the access layer IAS. The number of nodes supported by each city on the fourth layer of IAS is 1000, which can basically meet our needs. Based on this, we design the following solutions as follows:

In general, the "business + region" model is adopted to separate traffic. When there are more than 500 RS nodes in a city in a cluster, you need to consider splitting services. For example, if the nodes of a public cluster exceed the threshold, you can split the current video service with large business volume and use it as a separate cluster; if it is an independent cluster If the number of business nodes exceeds the threshold, first consider adding cities and split the traffic to new cities. If it is not possible to add a new city, consider adding an IAS cluster at this time, and then distribute the traffic to different clusters on the GLSB according to the region.

2) Different resource pools in the same city have different sets, so how does IAS connect to different sets in the same city?
Polaris originally has [wildcard group function], but IAS does not support the set wildcard function. In fact, we promoted IAS to implement wildcard group matching. For example, using bls.sh.% can match bls.sh.1, bls.sh.2 , bls.sh.3. Note that the wildcard of IAS is different from that of Polaris. Polaris uses * . When IAS was launched, it was found that some users used * for individual matching, so% is used to represent wildcards.

3) The difficulty encountered in resource management is that the nodes on the fourth layer of IAS cannot use computing resources. Later, after communication, IAS is connected to computing resources. The solution here is to use SNAT capabilities.

Notes for this plan

  • Only the IP address can be bound, the instance cannot be pulled, and the instance will not be automatically unbound when it is destroyed. You need to actively unbind it through the console or API (the instance has been cross-account, and the instance cannot be pulled)
  • If it is a large-scale volume: Which gateways have been passed, which capacity needs to be evaluated, risk control, need to be evaluated

Single machine failure automatic processing

Single machine fault handling effect

Single machine failure automatic processing, the goal is to achieve zero manpower maintenance, the following figure is a screenshot of our automatic processing.

Single machine fault handling solution

Single-machine failures are mainly considered from two dimensions, the system level and the business level. The details are as follows:

DimensionAlarm item
System levelCPU
System levelRAM
System levelThe internet
System levelDisk
Business levelATTA Agent is unavailable
Business levelThe queue is too long
Business levelThe success rate of sending atta data

For single machine failures, we use the open source Prometheus + Polaris (registry center) to solve the problem. Prometheus is mainly used to collect data and send alarms, and then remove the node from Polaris through code.

As for the processing of alarm occurrence and alarm recovery, when an alarm occurs, the number of alarm nodes is first judged. If it is less than three, we directly remove the nodes in Polaris. If there are more than three, it may be a common problem. At that time, we will send an alert, which requires manual intervention. When the alarm is restored, we directly restart the node on the platform, and the node will re-register Polaris.

ATTA Agent exception handling

As shown in the figure, the processing flow is two lines, alarm triggering and alarm recovery. When the business is abnormal, the current number of abnormal nodes is first judged to ensure that nodes will not be removed in a large range. Then remove the node at Polaris. When the business is restored, the node is directly restarted.

Problem and solution

The main difficulty is the health check of the Prometheus Agent and the dynamic changes of the BeaconLogServer node. For the first problem, the platform is currently responsible for maintenance. For the second question, we used the timing script to pull nodes from Polaris and Prometheus hot-loading capabilities.

Summarize

The cloud migration effectively solved the two problems of automatic capacity expansion and single-machine failure, reduced manual operations, reduced the risk of human operation errors, and improved the stability of the system. Through this visit to the cloud, I also summarized a few points:

  • Migration plan: Before going to the cloud, do a good job of investigating the migration plan, especially the functions that rely on the support of the system to reduce the systemic risk of the migration process due to the system not supporting it.
  • Migration process: do a good job of monitoring indicators, and after migrating traffic, observe the indicators in time, and roll back in time when problems occur.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝