Resource utilization increased by 67%, Tencent&#39;s real-time risk control platform cloud-native containerization road

Lead

With the continuous expansion of the department in the field of business security, focusing on service scenarios such as verification codes and financial advertisements, Tencent Waterdrop, as a real-time risk control system supporting business security confrontation, requires more and more real-time tasks for online tasks, and businesses that need support The number of requests also increased. In response to the need for rapid business launch and rapid expansion and shrinking of resources, and the company's self-developed cloud project is advancing in the direction of full containerization, the water drop risk control platform has begun to transform the self-developed cloud. This article mainly focuses on the practice summary of Tencent's water droplet platform in the cloud process, hoping to have certain reference value for other businesses to migrate to the cloud.

Waterdrop backend architecture

The water drop platform is mainly a high-availability, high-performance, and low-latency real-time risk control strategy platform for business security confrontation. It provides a series of basic components for strategy personnel to build strategy models, which can help strategy personnel quickly complete the strategy model Build and test verification.
water drop system architecture is shown in the figure below:

The water drop real-time wind control platform system is mainly composed of two parts: configuration processing module and data processing module

configuration processing module mainly composed of front-end web pages, cgi, mc_srv and Zookeeper. Strategy developers edit the strategy model, create, launch and update strategy tasks on the front-end page of the water droplet. The completed strategy model information is stored in the Zookeeper data center in json format through the cgi and mc_srv interfaces, and the data processing module is pulled through the agent. Get the policy information corresponding to different services on Zookeeper.

data processing module mainly composed of access, engine and external systems, and the core processing logic is the engine module. Start independent engine instances for different businesses to ensure isolation between businesses. Business data requests are sent to the designated Polaris service address or ip:port address, and the access layer receives the request data and forwards it to the engine instance of the corresponding task according to the task number. , If there is an external system access component, the engine will query the external system for the requested data.

Self-developed cloud practice

In the process of transforming the water drop platform to the cloud, first familiarized and tested the features of the TKE (Tencent Kubernetes Engine) platform, and sorted out the key issues affecting the service to the cloud:

The Monitor monitoring system has not been connected to TKE. If the Monitor indicator monitoring system continues to be used, a lot of manual intervention will be required;
Responding to sudden traffic situations requires artificial expansion and contraction, which is too cumbersome and untimely;
TKE supports the Polaris rule, the original CL5 (Cloud Load Balance 99.999%) has some problems;

In response to the above-mentioned key issues of self-developed cloud migration, we have carried out transformation and optimization from the aspects of index transformation, containerization transformation, and traffic distribution optimization to ensure the smooth migration of business services to the cloud.

Indicator monitoring transformation

The Tencent Water Drop platform uses the Monitor monitoring system for system indicator view viewing and alarm management. However, during the migration to the cloud, we found that the Monitor monitoring indicator system has many problems affecting the cloud. In order to solve the problems of the original Monitor indicator monitoring system, we The indicator monitoring system was transformed from the Monitor monitoring system to the Zhiyan monitoring system.

Monitor monitoring system problems

TKE is not connected to the Monitor monitoring system. When the IP address of the instance on the cloud changes, it is necessary to manually add the corresponding container instance IP to the Monitor system, and the frequent changes of the instance IP in the cloud scenario are difficult to maintain.
Monitor indicators for container instance indicators in the NAT network mode cannot be distinguished at the instance level. The same attribute indicators in the NAT network mode are not conducive to instance level indicator viewing.
The Monitor monitoring system has poor flexibility. When new attributes are added, it is necessary to apply for the attribute ID and adjust and update the code.

Zhiyan Index Transformation Process

The indicator reporting of the Monitor monitoring indicator system is mainly attribute ID and attribute indicator value. For different indicators, you need to apply for the attribute ID in advance, and integrate the Monitor SDK to call the embedded point of different attribute IDs during the implementation of the platform system. For example: different task request volume indicators need to apply for the attribute ID in advance

In the process of using Zhiyan indicator transformation, the implementation of our platform system integrates Zhiyan Golang SDK, and the original Monitor indicators are reported and transformed by Zhiyan. The idea of attribute indicators needs to be converted to multi-dimensional indicators, and a certain understanding and indicator design of the dimensions and indicator concepts of Zhiyan are required.

For example: set the task dimension, and the task value can be realized by calling and reporting

Smart Research Index and Dimensional Design:
In the process of realizing the transformation of Zhiyan's indicators, the most important thing is to understand the meaning of indicators and dimensions.

indicator: as a metric field is used for aggregation or related calculations.

dimension: is the attribute of the indicator data, and usually use cases to filter different attributes of the indicator data.
Dimension attributes use the unified abstract characteristics of the indicator data, such as instance IP, task number, component ID, indicator status, and other dimensions. Attributes that cannot be abstracted into dimensions are used as indicator attributes. In the early stage of the index transformation of Zhiyan, a reasonable dimension design was not carried out, which led to the confusion of the index and dimension selection, which was not convenient for subsequent addition and maintenance.

Zhiyan alarm notification optimization

After the completion of the indicator transformation of Zhiyan, we distinguished the indicator alarms on the platform side and the business side, forwarded the indicator alarms related to the business side to the business side through the alarm callback method, and notified the business side to handle the abnormal situation in time, which improved the business. The timeliness of receiving abnormalities on the side and reducing the interference of the platform side in processing business side alarms.

Optimize the display of Zhiyan indicator view Dashboard, and integrate common Zhiyan indicator views into Zhiyan DashBoard page, so that operators can quickly understand the key indicators.

Route distribution optimization

Routing problem

1. When CL5 queries a certain SID node for the first time, it is easy to encounter the following problems of -9998

2. The CL5 SDK cannot perform nearby access in the NAT network mode, and the data request is prone to timeout in the case of remote service

Migrate Polaris (Tencent Service Discovery Governance Platform) Transformation

In the case of request routing using CL5, when the container instance uses NAT mode, the IP address of the physical machine cannot be obtained using the CL5 interface, resulting in failure to access the requested data nearby. After the load balancing API interface is adjusted from CL5 to the Polaris service interface, the Polaris interface can be used to normally obtain the IP information of the physical machine where the container instance is located under the NAT network model, thereby enabling nearby access.

During CL5 migration to Polaris, replace the original CL5 SDK with Polaris polaris-go (Golang version SDK)
Polaris-go use example:

//***********************获取服务实例***************************
// 构造单个服务的请求对象
getInstancesReq = &api.GetOneInstanceRequest{}
getInstancesReq.FlowID = atomic.AddUint64(&flowId, 1)
getInstancesReq.Namespace = papi.Namespace
getInstancesReq.Service = name
// 进行服务发现，获取单一服务实例
getInstResp, err := consumer.GetOneInstance(getInstancesReq)
if nil != err {
   return nil, err
}
targetInstance := getInstResp.Instances[0]
//************************服务调用上报*************************
// 构造请求，进行服务调用结果上报
svcCallResult := &api.ServiceCallResult{}
// 设置被调的实例信息
svcCallResult.SetCalledInstance(targetInstance)
// 设置服务调用结果，枚举，成功或者失败
if result >= 0 {
   svcCallResult.SetRetStatus(api.RetSuccess)
} else {
   svcCallResult.SetRetStatus(api.RetFail)
}
// 设置服务调用返回码
svcCallResult.SetRetCode(result)
// 设置服务调用时延信息
svcCallResult.SetDelay(time.Duration(usetime))
// 执行调用结果上报
consumer.UpdateServiceCallResult(svcCallResult)

Containerized transformation

According to the waterdrop platform architecture diagram, after the business side creates different tasks on the waterdrop platform, different engine instances will be launched on the waterdrop platform to perform calculation operations on the corresponding tasks. The waterdrop platform tasks and the waterdrop task engine instances have a 1:N relationship. The more engine instances that need to be deployed online. In order to quickly launch different waterdrop task engine instances, we need to be able to ensure that the engine instances corresponding to the tasks are quickly deployed and online. Therefore, containerization of the engine instance modules and self-developed cloud access can improve operational efficiency.

As the request volume changes, the data processing module of the Waterdrop platform needs to expand and shrink the access instance and engine instance. Therefore, the access and engine instances will be frequently expanded and contracted.
water drop data processing module architecture diagram:

Physical machine deployment

Task creation: When a new task is added, you need to apply for the Polaris name service address corresponding to the new task, deploy the task's engine process to start on a different physical machine, and manually bind the engine instance to the Polaris name service, which requires manual process Start and manage, add and modify the corresponding load balancing service, complicated management, high operation and maintenance cost
Task upgrade: During the task program upgrade process, all engine process programs corresponding to the task need to be updated, and all related engine processes must be restarted
Task expansion and contraction: The task expansion process needs to be deployed on the physical machine and start a new engine process, and then the new process instance is added to the corresponding Polaris name service; the task is scaled up, and the scaling process needs to be started first Remove from Polaris name service, and then suspend the corresponding engine process. The service expansion process is similar to the service upgrade process.

TKE platform deployment

Task creation: When adding a new task, you need to apply for the Polaris name service address corresponding to the new task, and then create an engine application instance corresponding to the task on the TKE platform
Task upgrade: the task program upgrade process, just update the mirror version of the engine instance corresponding to the task
Task expansion and contraction: the task is in the process of expansion and contraction, and the expansion and contraction of the application instance is automatically adjusted by setting HPA (Horizontal Pod Autoscaler) on the TKE platform page

Cloud native maturity enhancement experience

1. Divide services into different business types. CPU-intensive services create Pod services with a large number of cores, and IO-intensive services (currently mainly deal with instantaneous traffic business conditions, and network buffers are likely to become bottlenecks) to create Pod with a small number of cores. For services such as CPU-intensive services, a single Pod takes 4 cores, while a single Pod for instantaneous burst traffic takes 0.25 cores or 0.5 cores.

2. The container service adopts the HPA mechanism, and the required CPU and memory resources are estimated according to the amount of service requests when the service is accessed, and the Pod service Request value is set by the estimated CPU and memory resources, and the Request value is usually maintained at 50 of the Limit value. %about.

Self-developed cloud effect

In the process of the water droplet platform migration to the cloud, the self-developed platform has been migrated to the TKE cloud, which has brought a lot of efficiency improvements.
The efficiency improvements brought about by the cloud are mainly as follows:

The application process for cloud resources is simpler and faster. The process cycle for machine application and relocation, virtual IP application, and machine transfer before the cloud is about one week, and the resource application cycle after the cloud is shortened to an hour level
machine resource utilization rate is increased by 67%, the CPU utilization rate is about 36% before the cloud, and the CPU utilization rate is 59.9% after the cloud.
There is no need for manual expansion and contraction operations to cope with sudden traffic. The HPA mechanism can be used to complete expansion and contraction, which about 15 minutes to one or two minutes.
The business strategy deployment and cycle can be shortened from 2 hours to 10 minutes from 1612703ba67652.

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Resource utilization increased by 67%, Tencent's real-time risk control platform cloud-native containerization road

Lead

Waterdrop backend architecture