The Road to Cloud Native Containerization for AMS News Video Ads

author

Zhuo Xiaoguang , senior development engineer of Tencent Advertising, is responsible for the overall background architecture design of news video advertisements. He has more than ten years of high-performance, high-availability and massive background service development and practical experience. Currently leading the team to complete the comprehensive transformation of the cloud native technology stack.

Wu Wenqi , Tencent advertising development engineer, is responsible for the background development work related to the realization of news video advertising traffic. He is familiar with the application of cloud native architecture in production practice, and has many years of experience in the development of high-performance and high-availability background services. Teams are currently being pushed to embrace cloud native.

Chen Hongzhao , senior development engineer of Tencent Advertising, is responsible for the background development work related to the realization of news video advertising traffic. He is good at architecture optimization and upgrading, and has rich practical experience in massive background services. Currently focusing on the exploration of advertising systems in the direction of traffic scenarios.

I. Introduction

The news video advertising team is mainly responsible for the advertising monetization and collection of media advertising traffic such as Tencent News, Tencent Video, and Tencent Weishi. In the context of the increasingly complex media traffic and the gradual improvement of advertising monetization efficiency, the team needs to be responsible for the development and maintenance of background services With the rapid growth of the number and the increasing cost of service and maintenance, the traditional physical machine-based deployment and operation and maintenance solutions have become a bottleneck restricting the further improvement of team manpower.

Since 2019, the company has encouraged all R&D teams to migrate their services to the cloud, and carry out cloud-native transformation of services. In response to the company's call, after fully investigating the benefits of cloud services, we decided to transfer the services maintained by the team to the cloud in batches to improve the deployment and operation and maintenance efficiency of background services.

2. "Three Steps" for Business to Cloud

Figure 2-1 The overall architecture of the news video advertisement background

News video advertising mainly undertakes the access of advertising traffic from Tencent News, Tencent Video, Pianduoduo and Tencent Weishi, and is responsible for improving the monetization efficiency of traffic, including traffic access, form optimization, ecpm optimization, etc. As shown in the figure above, due to the complex and diverse business needs of access, and there are significant differences between them, such as content advertising, rewarded advertising, etc., the background services of news video advertising have the following characteristics:

The request volume is large, the online service needs to undertake massive requests from the traffic side, and the service performance and stability requirements are high;
The number of services is large, the load difference between different services is large, the load of the same service varies greatly in different time periods, and the labor operation and maintenance cost is high;
There are many external components to rely on, and the process of service release and operation and maintenance is deeply bound to the traditional physical machine model. The historical burden is heavy, and the smooth migration to the cloud is a big challenge.

In response to the above problems, we have customized a "three- step " plan in the process of migrating services to the cloud:

Build the basic components that cloud migration depends on and standardize the cloud migration process;
Offline services quickly go to the cloud;
Massive online services are smoothly migrated to the cloud;

3. The first step - building basic components & standardizing the cloud migration process

In order to improve the efficiency of service migration to the cloud and facilitate each service leader to migrate services to the cloud by themselves, we first built the basic components of cloud services and output the specifications for service migration to the cloud, mainly including containerized CI/CD and smooth service migration to the cloud. part.

3.1 Containerized CI/CD Configuration

If you want to go to the cloud, the first thing to consider is how to deploy the service on the cloud platform. We compare the deployment of physical machines and cloud platforms.

	physical machine	cloud platform
scheduling unit	server	Pod
Number of Units	few	many
Quantity change	Relatively fixed, the number of units can only be increased or decreased after the approval process is passed	Greater flexibility, dynamic scaling changes with the load at any time

Table 3-1 Comparison of physical machine and cloud platform deployment

As you can see, physical machines and cloud platforms face completely different situations. In the era of physical machines, we used Zhiyun deployment service to manually configure server targets, manage binaries and deployments on Zhiyun platform. After turning to cloud platform, a large number of pods dynamically change clusters and assume that management objects are servers with fixed IPs. Zhiyun does not fit. Therefore, we turned to Blue Shield for automated integration and deployment.

BlueShield integrates and manages the integration and release processes in a unified manner. Code integration will automatically trigger the compilation, regression, review and release processes without additional labor. In order to be compatible with the release process of the existing physical machine service and ensure the same version of the hybrid deployment service, the binary package is still managed by Zhiyun, and the products are compiled and pushed to Zhiyun by the Blue Shield pipeline. After the Blue Shield pipeline pulls the binary from Zhiyun and packages the basic image containing the necessary agent provided by the operation and maintenance, the environment and the binary are standardized, templated, and released as the final product in the form of an image. With capacity expansion, the workload of manually standardized servers is reduced. Ultimately, we achieved the following goals:

Automated deployment, saving manpower;
Compatible with the physical machine release process, which is convenient for hybrid deployment;
Templated environment for easy and rapid expansion.

Figure 3-1 BlueShield implements CI/CD pipeline

3.2 Background services are smoothly migrated to the cloud

3.2.1 Smooth migration of basic components to the cloud

The news video advertisement background service relies on basic components such as Polaris and Zhiyan to provide basic functions such as load balancing and indicator reporting. However, after migrating these basic components to the cloud, we found that they could not provide capabilities as scheduled, affecting the normal operation of the service. After in-depth investigation and analysis, we found that the reasons for these components not working properly mainly include the following two points:

The ip of the container does not belong to the idc network segment, and the agents of these basic components in the container cannot communicate with their servers;
The container IP will change with the upgrade and migration of the container, and the IP registered by the component will frequently fail.

These two problems are caused by the default overlay network policy configured by the TKE platform for the container. In order to solve the above problems, the TKE platform recommends that we adopt advanced network strategies such as "floating ip" and "recycling when APP is deleted or reduced in size". The floating ip strategy is different from the overlay. It allocates the ip of the idc network segment specified by the operation and maintenance, and services on other idc machines can directly access it; the "recycle when deleting or shrinking the APP" strategy binds the ip to the corresponding pod, even if the pod is Update and migration, ip will not change.

When adding a new workload, we configure the floating ip in the advanced settings and the strategy of recycling when deleting or shrinking the APP to ensure that the components of the incremental load work normally; at the same time, modify the yaml configuration of the existing load and add the configuration items as shown in the figure below , aligns the configuration of the stock load with the incremental load. The two-pronged approach finally solves the impact of network policies on basic components and promotes the collaborative work of basic components and cloud platforms.

Figure 3-2 New load policy selection on TKE platform

Figure 3-3 TKE platform inventory configuration modification

3.2.2 Traffic Smooth Migration

After the service is successfully deployed on the cloud, the next step is to allow the cloud cluster to undertake external traffic and function. News video traffic is currently routed to physical machines. If you directly switch to access cloud services in full, you will face the following problems:

The cluster service on the cloud has not been tested by traffic, which may lead to unknown problems being exposed online;
The new cluster is not warmed up enough and can tolerate a low QPS.

It is conceivable that such a radical operation is extremely dangerous, and it is easy to cause accidents or even cluster collapse. Therefore, we need to smoothly switch traffic to the cloud platform, leaving plenty of time to deal with problems in the process of cloud migration.

For the goal of smooth handover, we propose two cloud migration guidelines:

Jog in small steps. The traffic is migrated in multiple stages, and only a small amount of traffic is switched to the cloud platform each time. After switching, the next traffic migration can be performed after observing that there is no abnormality in system monitoring and business indicator monitoring.
Grayscale verification. For the release operation of the standard physical machine, remove some resources to build a grayscale cluster. Before each traffic switch, experiment on the grayscale cluster. After the grayscale cluster verification is completed, perform the corresponding operation on the official cluster.

In the end, we found and solved the problem in advance before the full switch, to ensure the stability of online services, and to achieve a smooth and smooth migration of traffic.

Fourth, the second step - 150+ offline services to quickly migrate to the cloud

There are a large number of offline background services for news video advertisements, with a total of 150+ services. The functions are also very complex and diverse, ranging from high-computation services such as video feature extraction to low-load services that only provide notification capabilities. In the process of migrating to the cloud, how to design resource selection, optimization and scheduling schemes suitable for the numerous offline services? For this problem, we have concluded a set of cloud migration solutions that meet the characteristics of offline news video advertising.

4.1 Offline Service Resource Allocation Scheme

4.1.1 Design of Resource Allocation Template

In order to improve the utilization of CPU and memory, we hope to allocate appropriate resources to different types of services to ensure full utilization of resources. In the process of going to the cloud, we have concluded a set of methods for allocating resources. For a single pod of offline service, the CPU is limited between 0.25 cores and 16 cores, and the minimum memory cannot be less than 512M. While meeting the lower limit, it is best to maintain a 1:2 ratio with the number of cores. What are the benefits of following this approach? Here's our summary of why:

If the number of CPU cores is greater than 32, the TKE platform may not be able to find a node with enough idle cores to meet the requirements during scheduling, resulting in the failure of Pod expansion;
If the number of CPU cores is greater than 32, the resources of the TKE platform may be fragmented and the overall resource utilization of the cluster may be reduced;
If the CPU allocated by the container is lower than 0.25 cores, or the memory is lower than 512M, the agents of the public components such as Polaris cannot run normally, and the container cannot be pulled up normally.

In actual use, we combine the actual combat experience of cloud migration, and summarize the recommended configuration tables for different types of services as shown below.


	Number of CPU cores	memory size
Offline timed polling service	0.5 cores	1G
Offline operation services	2 cores	4G
Offline computing services	8 cores	16G

Table 4-1 Recommended configurations for different types of services

4.1.2 Basic image simplification

Following the settings in the recommended configuration table, we deploy the offline service on the TKE platform. In the beginning, the service can run smoothly. However, with the passage of time, there are more and more agents in the public image, and the resources occupied by the agent become larger and larger. For services with less allocated resources, such as notification services, the resources of the agent After the occupancy is increased, it even exceeds the resources used by the service itself, resulting in abnormal performance such as OOM or the container cannot be pulled up.

We not only want to optimize the resource consumption increase brought by the growing number of agents, but also want to enjoy the update of public mirrors. Is there a way to have the best of both worlds? The answer is yes. When we build a business image in the CI/CD pipeline, we add the process of deleting redundant agents through the RUN command. Inside the agent deletion process, we configure the necessary images to mark service dependencies, deselect other useless agents, and execute deletion. Since the RUN command adds an immutable file layer, it does not affect the previous public image file layer of this layer. When the public image updates the agent, it will also affect the business image. In this way, we not only save the resources used by the agent, ensure the normal operation of the service with low resource allocation, but also enjoy the convenience brought by the automatic update of the agent by the public image.

Figure 4-1 Schematic diagram of mirror slimming

4.2 Offline service HPA configuration scheme

The load of many offline services is not constant, and the load during the day and night is often quite different. If constant resources are allocated to these services, resources may be overloaded during peak periods of service load and idle during periods of low load. Therefore, we want the resources allocated to the container to be able to change as the service load changes. To this end, we choose to use HPA components to achieve elastic scaling.

4.2.1 HPA configuration template design

If you want to use HPA, you first need to select the indicator that triggers whether the capacity should be scaled or not. The core idea of selecting measurement indicators is to try to select the indicators with the largest changes, and avoid the indicators with small changes that limit the changes in the number of Pods and cause the changes of other load indicators to exceed the resource limit. For offline tasks, generally speaking, the CPU usage is more likely to change with the start and end of the task, while the memory generally stores a relatively fixed context and is relatively stable. It is reasonable to select the CPU usage as the criterion.

Figure 4-2 Selecting CPU utilization as a measure of HPA

Second, we want to specify upper and lower bounds on the number of Pods. The upper limit can prevent incorrect configuration from creating a large number of Pods, consuming cluster resources, and affecting cluster stability. The lower limit has two functions. First, it ensures that the number of Pods meets the limit that the minimum replicas of HPA components are not zero, so as to prevent the components from not being able to collect metrics, and to lose the function of adjusting the number of replicas after they cannot obtain the indicators on which scaling and scaling adjustments depend. As a result, the cluster falls into a situation where the number of pods is constant at 0 and no pods are available for service. Second, if a small number of pods fall into an unserviceable state due to failures, ensuring a certain number of pods can reduce the impact of failures on services.

Figure 4-3 Setting the upper and lower limits of the number of HPA adjustment instances

Finally, we need to decide the elastic scaling strategy according to the specific load change curve.

If the load curve is highly consistent every day and every time period, we can consider using the CronHPA component to manually specify the number of Pods in different time periods and schedule them regularly.
If the load curve changes every day, regardless of the trend or value, we can use the HPA configuration to set the CPU usage reference value, and ask the cluster to adjust the number of Pods when the utilization exceeds or falls below the specified value, and the CPU of the Pod is adjusted. Utilization is maintained within tolerance.
If the load curve changes every day, but there are regular peaks, in order to avoid the cluster's self-adjustment rate being slower than the load growth rate, causing the cluster to be overwhelmed, we can combine CronHPA and HPA, and usually hand over control to HPA, Use CronHPA to expand capacity in advance before the peak, which not only ensures that the peak will not overwhelm the cluster, but also enjoys the convenience of automatically adjusting the number of Pods in the cluster.

Figure 4-5 Using CronHPA with HPA

As a result, we have completed the HPA settings, and through various strategies, the cluster allocation resources can change with the service load.

4.2.2 IP Whitelist Automatic Change

The dynamic change of the number of service pods will cause the IP addresses to change frequently in environments where some services are running. When advertising services access downstream interfaces, most of them need to pass static IP whitelist verification. Static IP whitelists are no longer suitable for services deployed in cloud-native environments. We hope to promote downstream IP whitelists to support dynamic addition of container IPs and embrace cloud-native. To this end, we use different processing methods to complete the transformation according to the sensitivity level of downstream permissions.

For interfaces with lower sensitivity levels, we encourage interface authors to provide an interface for automatic IP reporting, issue credentials for each user, use the calling interface before the service starts, and report the current IP address to be added to the whitelist. For example, Polaris provides us with a report script, and we only need to call the report script in the startup script of the container to complete the report. Since the IP will not change during the service operation, only the IP when the Pod is pulled up needs to be reported to ensure the stable operation of the service.
For interfaces with a high sensitivity level, the author does not trust the operation from the user, and is worried that the user's wrong operation will break down the whitelist protection. For example, the CDB team is worried about the leakage of advertising privacy data. We promote these teams to cooperate with the TKE platform, and open authorization to the TKE platform. After the user applies for the authentication certificate, it is hosted on the TKE platform, and the TKE platform is responsible for reporting new IP when the container is pulled up.

Figure 4-6 TKE platform configuration authorization

In the end, we realized the perception of container scaling and expansion by the downstream interface, and automatically updated the whitelist to ensure that the service works normally when the elastic scaling and expansion take effect.

Fifth, the third step - smooth cloud migration of massive online services

The news video advertising system carries tens of billions of traffic, and the QPS of the background online service can reach the highest level of hundreds of thousands. After migrating to the cloud, it is also necessary to ensure the high performance, high availability and scalability of the service. In order to meet this requirement, we cooperate with the operation and maintenance and TKE platform parties to solve the problems encountered in the process of cloud migration, and ensure the smooth and smooth cloud migration of massive online services.

5.1 Compute-intensive service latency glitch optimization

In advertising systems, online services are mostly computationally intensive and require high performance to ensure that computations are completed on time. If the related computing logic is frequently scheduled between different CPU cores, the cache miss frequency will increase, the program performance will be reduced, and the request delay will often appear glitches. Therefore, services deployed on physical machines use a large number of core-binding capabilities, and manually specify the CPU on which the service runs to improve locality and program performance.

However, after going to the cloud, the platform performs virtualized management of CPU resources, and the service obtains the virtual CPU list after allocation, isolation and reordering through "/proc/cpuinfo" in the container, which is very different from the actual CPU list of the node. Using the virtual CPU list for core binding operation may not only bind to an unallocated CPU, but also may not perform as expected, and even bind to a non-existing CPU, causing program errors.

Figure 5-1 List of virtualized CPUs in a container on a 96-core machine node

Therefore, we need to find a way to get the actual CPU list on the cloud platform. Considering that the cloud platform manages and isolates resources through cgroups, we can try to find an interface to obtain the actual CPU list in cgroups. So we searched for information and found that cgroup provides the "/sys/fs/cgroup/cpuset/cpuset.cpus" subsystem, which can obtain the actual mapping range of the virtual CPU list on the physical list, which meets our requirements. We modified the code for obtaining the CPU list in the core binding function, and changed the part of reading the proc subsystem to reading the cgroup subsystem, so as to successfully implement the core binding function of the cloud service.

Figure 5-2 List of CPUs actually allocated by cgroup

5.2 High Availability Guarantee for Online Services

The life cycle of an online service is divided into three stages, service startup, ready, and service destruction. Services are only available to the outside world when they are in the ready phase. The availability of the cluster depends on the proportion of services that are ready for load balancing. Therefore, in order to improve the availability of services, efforts can be made in two directions:

Reduce the duration of service startup and increase the proportion of ready state in the service life cycle.
Reduce the failure rate of accessing services, and ensure that services in the loading and destroying stages are not added to load balancing.

5.2.1 Reduce service startup time

If you want the startup time of the service, you need to analyze the operations that take a high proportion of time during the startup process of the service. After analysis, we found that during the service startup process of the advertising service, one of the steps is to use the file synchronization service byteflood to subscribe to files that synchronize data such as advertisements, materials, and advertisers to the local container. This step is very time-consuming and accounts for the bulk of the time-consuming context loading phase. Is there any way to reduce the time-consuming of this step?

After digging deeper, we found room for optimization. It turns out that byteflood needs to pull the full amount of data files every time. Why can't it be incrementally pulled? Because the byteflood component is synchronized to the local files in the container and stored in the variable layer of the container, once the container is rebuilt due to upgrade or migration, all data files will be lost, and the full amount of files must be pulled again. It seems that only by persistently storing data files, we can avoid file loss and update data by incremental pulling, thereby reducing the time-consuming of data subscription steps, shortening the loading time of contexts, and improving service availability.

We store data files by mounting external data volumes (volumes). The external data volume is independent of the container file system, and container reconstruction will not affect the files in the external data volume, ensuring the persistence of data files.

Figure 5-3 Persistent subscription data files using mount points

However, after using volume mounts, we ran into a new problem with inconsistent paths. Due to the limitation of the TKE platform specification, the mounted path is inconsistent with the physical machine. In order to keep the service configuration of the cloud service and the physical machine consistent, we want to point the configuration path to the mounting path through a soft link. However, the mounting behavior occurs when the container is pulled up, and before the service process starts, it is necessary to ensure that the configuration path contains data files. If the soft chain needs to be manually maintained after the pod is started, not only the effective time may cause the service to not be able to read the data after the service process performs the read operation, but the generated soft chain also faces the problem of being discarded after the container is rebuilt. To this end, we replace the container's entrypoint, that is, the command invoked when the container starts, with a self-implemented startup script, add a statement to generate a soft chain in the script, and put the service startup statement behind the soft chain. When the container is started, the soft chain operation is executed without manual processing, and the reconstruction can be guaranteed to be executed again, which solves the problem of configuration paths.

Figure 5-4 The soft link points to the actual mount path and aligns the configuration path

In the end, we successfully plugged in the data files and reduced the service startup time from 5 minutes to 10 seconds, with remarkable results.

5.2.2 Reduce the failure rate of accessing services

If the status of the service in the container is loading or destroyed, it will not be able to process the request. If the IPs of these containers that cannot process the request are in the list of load balancing, the cluster availability will be reduced. In order to reduce the failure rate of requesting access to services, it is necessary to ensure that the services associated with load balancing are in a ready state, and to ensure that services associated with load balancing are in a ready state, the key is to incorporate the associated status of load balancing into the life of the service. Cycle management. Before the service leaves the loading state, the container is not allowed to join the load balancing service. If the service needs to be changed to the destroyed state, the container address needs to be removed from the load balancing service before the change.

On the platform side, the company's load balancing service - Polaris - has been incorporated into the life cycle of the container. Not ready containers are not added to Polaris, and the container address will be removed from Polaris when the container is destroyed. However, since the life cycle of a container is different from the service life cycle. When the container enters the Ready state, the service may still be loading the context, but it cannot provide services after being added to Polaris; when the container is destroyed, the platform initiates a culling request, but the internal state of the Polaris component is not updated immediately, and it may still forward traffic to the container after the container is destroyed. Destroyed container. It is necessary to find a suitable way to make Polaris aware of the service life cycle and avoid traffic forwarding to the service in the loaded or destroyed state.

If you want to prohibit the service in the loading state from joining the load balancing, you can use the readiness check function provided by the platform. The readiness check periodically checks the tcp port status of the business to determine whether the service has been loaded. If the service is not loaded, set the Pod status to Unhealthy, and prohibit the upstream Polaris from forwarding traffic to this container until the loading is complete. With readiness checks, we ensure that no requests are sent to the Pod where the service resides until the service is loaded.

Figure 5-5 Configuration readiness check

To ensure that the service address is kicked out of the load balancing before the service is changed to the destroyed state, it is also necessary to use the function of the TKE platform-post script. The platform side allows the business side to specify a script that will be executed before the container is destroyed. We perform the operation of waiting for a period of time in this script to ensure that the upstream load balancer updates the state before destroying the container. Through the post script, we ensure that the load balancer will no longer forward any requests to the container before the container is destroyed.

Figure 5-6 Postscript example

5.3 High Concurrency Service Connection Failure Optimization

Most online services for news video traffic need to carry massive requests and handle a lot of concurrency at the same time, such as the back-end service that combines quality and efficiency. During the process of migrating these services to the TKE platform, with the gradual increase of traffic, the amount of system failures increased significantly. In particular, some services using short connections experienced a large number of connection failures. To this end, we cooperated with the operation and maintenance and the students on the TKE platform side to investigate and solve the problem, and smoothly promote all online services of the news video team to the cloud, and at the same time increase everyone's confidence in migrating massive services to the cloud.

After investigation, we found that the increase in error rate was mainly caused by two points.

1. The traffic statistics function of the kernel occupies the CPU for a long time and causes network processing delay. Kubernetes manages NAT rules to control traffic forwarding containers through the ipvs module, and ipvs enables traffic statistics by default. It registers timers with the kernel to trigger statistics operations. When the timer is triggered, the CPU is occupied by the spin lock, and the rule statistics are traversed. Once the number of Pods on the node is large and the number of rules is large, the timer will occupy the CPU for a long time, affecting the service process to process the returned packets.
1. The number of connections exceeds the upper limit of the NAT connection tracking table causing new connections to be dropped. Kubernetes records connection information through the nf_conntrack module, and the actual implementation of nf_conntrack is a fixed-size hash table. Once the table is filled, new connections are dropped.

Figure 5-7 The connection table full error displayed in the service log

For the first point, we want to turn off the traffic statistics function of the kernel module ipvs. However, the tlinux version of the cluster is different from the Internet version, and it lacks the switch for the traffic statistics function. The platform side promotes the new cluster tlinux patch synchronization mechanism, applies the patch of the external network version to the cluster nodes, and adds the kernel parameter of the traffic statistics switch. After we configured the statistics function off, the number of errors dropped significantly.

Figure 5-8 The effect after installing the patch and disabling the statistics function

For the second point, we want to increase the size of the connection tracking table to avoid the problem of dropped connections. The platform responds in a timely manner and adjusts the kernel parameters to ensure that the size of the tracking table is larger than the current number of connections. After modifying the configuration, the service log no longer prints "table full", and the number of errors is greatly reduced.

Figure 5-9 The number of errors soars after increasing the tke traffic weight

Figure 5-10 The number of errors dropped to almost zero after resizing the hash table

With the help of the TKE platform, we jointly solved the problem of connection failure in a high-concurrency environment during the process of migrating our business to the cloud, and successfully migrated the online services of news and video traffic to the TKE platform.

6. Achievement display

Figure 6-1 Cloud migration results

comparison	Before going to the cloud	After going to the cloud
Resource allocation	Manual application, low flexibility	Flexible adjustment
Resource management	artificial	Platform automatic allocation
resource utilization	often face waste	Shrink capacity at trough, expand capacity at peak, make full use of
focus point	Contains machines and services	Focus on service

Table 6-1 Comparison before and after cloud migration

The news video advertising team actively embraces cloud native. With the cooperation and support of the TKE platform and the operation and maintenance team, all 150+ background services are promoted to the cloud, and the cumulative core load of the cloud has reached 6,000+, which greatly improves the operation and maintenance efficiency and resource utilization. . The resource utilization rate has been increased by up to 10 times, and the operation and maintenance efficiency has increased by more than 50%. During the cloud migration process, customized transformation and adaptation to cloud native have been carried out according to the characteristics of the service, and multiple sets of cloud migration practices for mass services, offline services, etc. have been accumulated. The solution effectively improves the efficiency of cloud service.

about us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !