Not only going to the clouds, but also going ashore

author

Chang Yaoguo, Tencent SRE expert, currently works in the PCG-Big Data Platform Department, responsible for cloud migration, monitoring and automation of tens of millions of QPS businesses.

background

BeaconLogServer is the portal for reporting data from the Beacon SDK. It receives data reports from many businesses, including Weishi, QQ, Tencent Video, QQ browser, and App Store. It presents problems such as large concurrency, large requests, and sudden increase in traffic. The QPS of BeaconLogServer reaches more than tens of millions. In order to deal with these problems, it usually takes a lot of manpower to maintain the capacity level of the service. How to use the cloud to achieve zero manpower operation and maintenance is the focus of this article.

Hybrid cloud elastic scaling

The overall effect of elastic scaling

First, let’s talk about automatic expansion and contraction. The figure below is a scheme of BeaconLogServer hybrid cloud elastic scaling design.

Elastic scaling solution

Resource management

Let’s start with resource management. At present, the number of BeaconLogServer nodes is more than 8000, which requires a lot of resources. Relying on the public resources of the platform alone, it may not be possible to achieve rapid expansion when the traffic increases during some holidays. Therefore, after investigation 123 platform (PAAS platform) and computing power platform (resource platform), we adopted a hybrid cloud approach to solve this problem.
Analyzing the BLS business scenario, there are two situations for sudden increase in traffic:

Daily business load increased slightly and lasted for a short period of time
During the Spring Festival, the business load increased significantly and lasted for a period of time
For the above business scenarios, we use three resource types to deal with different scenarios, as described in the following table:

type	Scenes	set
Public resource pool	Daily business	bls.sh.1
Computing platform	Small peak	bls.sh.2
Dedicated resource pool	Spring Festival	bls.sh.3

For daily business, we use public resource pool + computing power resources. When the load of the business rises slightly, use computing power resources to quickly expand capacity to ensure that the service capacity level does not exceed the safety threshold. Faced with a substantial increase in load during the Spring Festival, a dedicated resource pool needs to be constructed to cope with the increase in traffic during the Spring Festival.

Flexible expansion and contraction

The above explained the management of resources, then for different resources, when to start to expand and when to start to shrink?

The daily traffic distribution of is 161a9eaad99ad2 123 platform public resources: computing power platform=7:3 . The current threshold for automatic expansion is 60%. When the CPU usage rate is greater than 60%, the platform automatically expands. The elastic expansion and contraction relies on the scheduling function of the 123 platform. The specific indicators are set as follows:

type	CPU automatic scaling threshold	CPU automatic expansion threshold	Minimum number of copies	Maximum number of copies
123 platform public resource pool	20	60	300	1000
Computing platform	40	50	300	1000
123 platform proprietary resource pool	20	60	300	1000

It can be seen that the automatic shrinkage threshold of the computing power platform is relatively large, and the automatic expansion threshold is small. The main consideration is that the computing platform is to deal with the sudden increase in traffic, and the computing power platform resources are frequently replaced, so the priority is to expand or shrink the computing power first Resources of the platform. The minimum number of copies is the minimum resource requirement required to guarantee the business. If it is less than this value, the platform will automatically supplement it. The maximum number of copies is set to 1000 because the maximum number of RS nodes supported by a city on the IAS platform (gateway platform) is 1000.

Problem and solution

In the process of advancing the plan, we also encountered many problems. Pick a few problems to share with you.

1) First of all, from the perspective of the access layer, the previous access layer service used TGW. TGW has a limitation, that is, RS nodes cannot exceed 200. At present, there are more than 8000 BeaconLogServer nodes. To continue using TGW, you need to apply for many For domain names, migration is time-consuming and inconvenient to maintain. We investigated the access layer IAS. The number of nodes supported by each city on the fourth layer of IAS is 1000, which can basically meet our needs. Based on this, we design the following solutions as follows:

In general, the "business + region" model is adopted to separate traffic. When there are more than 500 RS nodes in a city in a cluster, you need to consider splitting services. For example, if the nodes of a public cluster exceed the threshold, you can split the current video service with large business volume and use it as a separate cluster; if it is an independent cluster If the number of business nodes exceeds the threshold, first consider adding cities and split the traffic to new cities. If it is not possible to add a new city, consider adding an IAS cluster at this time, and then distribute the traffic to different clusters on the GLSB according to the region.

2) Different resource pools in the same city have different sets, so how does IAS connect to different sets in the same city?
Polaris originally has [wildcard group function], but IAS does not support the set wildcard function. In fact, we promoted IAS to implement wildcard group matching. For example, using bls.sh.% can match bls.sh.1, bls.sh.2 , bls.sh.3. Note that the wildcard of IAS is different from that of Polaris. Polaris uses * . When IAS was launched, it was found that some users used * for individual matching, so% is used to represent wildcards.

3) The difficulty encountered in resource management is that the nodes on the fourth layer of IAS cannot use computing resources. Later, after communication, IAS is connected to computing resources. The solution here is to use SNAT capabilities.

Notes for this plan

Only the IP address can be bound, the instance cannot be pulled, and the instance will not be automatically unbound when it is destroyed. You need to actively unbind it through the console or API (the instance has been cross-account, and the instance cannot be pulled)
If it is a large-scale volume: Which gateways have been passed, which capacity needs to be evaluated, risk control, need to be evaluated

Single machine failure automatic processing

Single machine fault handling effect

Single machine failure automatic processing, the goal is to achieve zero manpower maintenance, the following figure is a screenshot of our automatic processing.

Single machine fault handling solution

Single-machine failures are mainly considered from two dimensions, the system level and the business level. The details are as follows:

Dimension	Alarm item
System level	CPU
System level	RAM
System level	The internet
System level	Disk
Business level	ATTA Agent is unavailable
Business level	The queue is too long
Business level	The success rate of sending atta data

For single machine failures, we use the open source Prometheus + Polaris (registry center) to solve the problem. Prometheus is mainly used to collect data and send alarms, and then remove the node from Polaris through code.

As for the processing of alarm occurrence and alarm recovery, when an alarm occurs, the number of alarm nodes is first judged. If it is less than three, we directly remove the nodes in Polaris. If there are more than three, it may be a common problem. At that time, we will send an alert, which requires manual intervention. When the alarm is restored, we directly restart the node on the platform, and the node will re-register Polaris.

ATTA Agent exception handling

As shown in the figure, the processing flow is two lines, alarm triggering and alarm recovery. When the business is abnormal, the current number of abnormal nodes is first judged to ensure that nodes will not be removed in a large range. Then remove the node at Polaris. When the business is restored, the node is directly restarted.

Problem and solution

The main difficulty is the health check of the Prometheus Agent and the dynamic changes of the BeaconLogServer node. For the first problem, the platform is currently responsible for maintenance. For the second question, we used the timing script to pull nodes from Polaris and Prometheus hot-loading capabilities.

Summarize

The cloud migration effectively solved the two problems of automatic capacity expansion and single-machine failure, reduced manual operations, reduced the risk of human operation errors, and improved the stability of the system. Through this visit to the cloud, I also summarized a few points:

Migration plan: Before going to the cloud, do a good job of investigating the migration plan, especially the functions that rely on the support of the system to reduce the systemic risk of the migration process due to the system not supporting it.
Migration process: do a good job of monitoring indicators, and after migrating traffic, observe the indicators in time, and roll back in time when problems occur.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Not only going to the clouds, but also going ashore

author

background

Hybrid cloud elastic scaling

The overall effect of elastic scaling

Elastic scaling solution

Resource management

Flexible expansion and contraction

Problem and solution

Single machine failure automatic processing

Single machine fault handling effect

Single machine fault handling solution

ATTA Agent exception handling

Problem and solution

Summarize

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

火热报名中| 第五届Light创造营邀你一起破茧成光！

2025版 RTC、直播、点播技术对比｜腾讯云/即构/声网如何选型

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

推理模型升级浪潮下，Agentic RAG 如何借力 DeepSeek 实现知识革命？

信息安全风云录，AI 时代安全江湖如何见招拆招？

2025免费云服务器盘点