Author

Zhang Yu, joined Tencent in 15 years and engaged in Tencent advertising maintenance. In 20 years, we started to guide Tencent's advertising technology team to access the company's TKEx-teg, from the daily pain points of the business and combined with the native features of Tencent Cloud to improve Tencent Advertising's own containerized solution

Project Background

Tencent Advertising carries the entire advertising traffic of Tencent, and accesses requests from external alliances. In all scenarios with increasing traffic, how to quickly allocate resources and even automatically schedule after a sudden increase in traffic has become what the advertising team needs to consider problem. In particular, this year's striped disaster recovery optimization of the overall advertising structure (delivery, playback) has a stronger dependence on the functions of allocating resources on demand and allocating resources by region.
In advertising, the streaming system carries the entire advertising broadcast function. The stability here directly determines the revenue of the entire Tencent advertising. The following is the architecture diagram:

Business characteristics:

  • The request volume is large, with an average daily request volume of nearly 100 billion. The number of machines accounts for more than 60% of AMS's own machines. Even a small amount of up and down fluctuations in overall performance will involve a large number of machine changes.
  • The link topology is complex and the performance pressure is extremely high. The entire playback link involves 40+ subdivision modules. In a very short time of 100 to 200 milliseconds (different traffic requirements vary), all modules need to be visited to calculate the best advertisement.
  • It is computationally intensive, and uses a large number of core binding and core-off capabilities to face the pressure of searching millions of advertising orders.

Cloud solution selection

In the past 20 years, Tencent Advertising has been on the cloud on a large scale, mainly using AMD's SA2 cvm cloud host, and has completed the compatibility and debugging of the network, the company's public components, and the advertising's own components. On this basis, the CVM-based Node cloud native has also begun to perform tuning and business use, elastic scaling, Docker transformation, extensive use of various PAAS services, and full use of the advanced functions of the cloud.
The following is the TKE structure used in advertising:

  • Preliminary resource preparation (upper left): Apply for resources such as CVM and CLB from Tencent's internal cloud official website platform, and at the same time apply for the subnet network segment required for master, node, and pods from the official cloud website (subnet is a regional distinction, such as Shenzhen Guangming, you need to pay attention to node It must be the same as the network segment allocation of pods in the area). Both CVM and CLB are imported into TKEx-teg. When FIP mode is selected, the generated PODS obtains its own EIP from the allocated subnet.
  • Use of warehouse and mirror (upper right): The basic mirror (mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest) is provided on the advertising operation and maintenance side. While the business is from the basic mirror, the git is pulled and the mirror is built through Blue Shield. , Complete the construction of the business image.
  • Container usage mode (lower part): Through the TKEx-teg platform, pull business images to start the container, and then use clb, Polaris and other services for external use.

The process of containerization (difficulties and coping)

Difficulty one: versatility
(1) Facing the 84 technical teams in advertising, how to achieve the adaptation of all businesses
(2) Image management: the basic environment is transparent to the business team
(3) Tencent advertising container configuration specification
Difficulty 2: CPU-intensive retrieval
(1) Number of advertising orders: millions
(2) Tied cores: CPU tied core isolation between applications
(3) Turn off the core: turn off hyper-threading
Difficulty 3: High availability in stateful service upgrade
(1) The continuous availability of advertising resources during the container upgrade process
(2) Continuous high availability in the process of iteration, destruction and reconstruction

Versatility

1. Introduction to Advertisement Basic Mirror

The advertising operation and maintenance side provides a set of basic images covering most of the application scenarios, which is based on XXXXXXX-base:latest, which integrates the various environment configurations, basic agents, and business agents of the original advertisement on the physical machine.
And based on this basic image, multiple business environment images are provided. The list of images is as follows:

mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest
mirrors.XXXXX.com/XXXXX/XXXXXXX-nodejs:latest
mirrors.XXXXX.com/XXXXX/XXXXXXX-konajdk:latest
mirrors.XXXXX.com/XXXXX/XXXXXXX-python:3
mirrors.XXXXX.com/XXXXX/XXXXXXX-python:2
mirrors.XXXXX.com/XXXXX/XXXXXXX-tnginx:latest

The specific mirror usage is as follows:

In the basic image of the advertisement, since systemd is not used in the permission set setting, the startup script is used as the No. 1 PID, and a startup script of the general Tencent General Agent & the unique advertising agent is built in the basic image. During the mirroring startup process, you can choose whether to call it in the respective startup script.

2. Containerized CI/CD

Originally used a lot of CD parts of other platforms, but now after using TKE, other platforms can no longer be used. The continuous integration part on TKEx-teg is weak for the realization of automated pipelines and requires manual participation. Therefore, the CI/CD solution introduced in the advertisement is Tencent's internal continuous integration and continuous deployment solution: Blue Shield.

The pipeline release is implemented throughout the entire process, without manual participation except for review, reducing the impact of human factors.

stage1 : Mainly use manual trigger, git automatic trigger, timing trigger, remote trigger

  • Manual trigger: easy to understand, you need to manually click to start the pipeline.
  • Automatic trigger: When git generates a merge, the pipeline can be automatically triggered, which is suitable for agile iteration of business.
  • Timing trigger: The trigger of the entire pipeline starts at a certain time every day. It is suitable for large-scale modules developed by oteam. Iterates once in a predetermined period of time. All participating personnel confirm the modification points of this iteration.
  • Remote triggering: relying on the use of other external platforms, such as the advertising review mechanism on its own platform (Leflow), can remotely trigger the execution of the entire pipeline after the entire release review is over.

stage2 & stage3 : continuous integration, custom compilation after pulling git

Blue Shield provides a default CI image for compilation. If you don’t perform binary compilation, you can choose the default (such as php, java, nodejs, etc.). The back-end business Tencent advertising uses blades a lot, usually mirrors.XXXXXX.com/XXXXXX/tlinux2 .2-XXXXXX-landun-ci:latest is the build image. This image is provided by the Tencent Advertising Performance Team, which integrates various environments and configurations of Tencent Advertising in the continuous integration process.
After the compilation is completed, the mirroring plug-in is used to rely on the dockerfile in the git library to build the mirror, and then push it to the warehouse, while keeping a copy in the Zhiyun.

stage4 : Online gray-scale set release, used to observe the data performance under gray-scale flow. Use the cluster name, ns name, and workload name to iterate the mirror tag of a workload, and use a TKEx-teg internal token for authentication.

stage5 : After confirming that stage4 is ok, start the full amount online, and each time it is verified and confirmed.

stage6 & stage7 : Statistics.

In addition, there is a notification function for the robot group in Blue Shield, which can customize the process information that needs to be notified and push it to a certain enterprise WeChat group for everyone to confirm and review.

3. Tencent Advertising Container Configuration Specification

The mother machine inside the advertisement is all Tencent Cloud Xinghai AMD (SA2), here is a 90-core hyper-threaded cpu+192G memory, the disk uses a high-speed cloud hard disk 3T, in daily use, this model is Tencent Cloud The largest model available at this stage (SA3 has been tested, and the highest model configuration will be larger).

  • Therefore, when the business is in use, it is not recommended that the number of pods cores is too large (for example, more than 32 cores). Due to the default setting of TKE affinity, each container will be pulled from the idle host computer as much as possible, so that in the middle and late stages of use (for example, cluster The fragmentation problem caused by 2/3) has been used, which will cause the pods with more than 32 cores to be almost impossible to expand. Therefore, it is recommended that users split the original high-core services into more low-core pods (the number of cores per pods is halved, and the number of pods as a whole is doubled) if users can expand the capacity horizontally when pulling containers.
  • When creating the workload, mount the emptyDir temporary directory to the log directory, so as to ensure that the directory will not lose data during the upgrade process and facilitate subsequent troubleshooting. (Destruction and reconstruction will still delete all files in this directory)

If it is an already online workload, you can modify the yaml to increase the mount of the directory:

        - mountPath: /data/log/adid_service
          name: adid-log
      volumes:
      - emptyDir: {}
        name: adid-log
  • In Tencent Advertising, a large number of striping functions are used, that is, the service is not limited to such areas as Shanghai, Shenzhen, and Tianjin. Striping can achieve a more detailed distinction. For example, the deployment of Shanghai-Nanhui is based on the computer room, which can realize the realization of disaster tolerance (most of the network failures are based on the computer room as the unit, so that you can quickly switch to another All the way). It can also reduce the time consumption after striping deployment. For example, the two computer rooms in Shanghai have a time consumption of 3ms due to the distance. In the process of large packet transmission, the time-consuming problem of cross-computer room deployment will be magnified. , There is a 10ms gap in the advertisement.

Therefore, in the process of using the advertising TKE striping, we will use the label to specify the computer room selection. Tencent Cloud has labeled the CVM of each computer room by default, which can be called directly.

The existing workload can also be modified by yaml to perform mandatory scheduling.

      spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - "370004"
  • It is recommended to use 4-16 core container configurations in the internal back-end of the advertisement. Most of the front-end platforms use 1 core, which can ensure that emergency expansion can also be carried out in scenarios where the cluster utilization rate is high. And if you want to isolate each pods by affinity, you can also use the following configuration for isolation (values is the specific workload name):
         affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - proxy-tj
              topologyKey: kubernetes.io/hostname
            weight: 100

4. HPA settings

In the process of using containerization, there are two ways to face the sudden increase in business traffic.

  • Set the request and limit of the container. The request resource can be understood as determining that 100% can be allocated to the business, while the Limit is the oversold resource, which is the part of the resource shared in the buffer pool. The advantage of this is that each business can be configured with request resources that are normally used. When a sudden increase in traffic occurs, the limit part of the resources will bear the performance problems after the request is exceeded.

Note: But the oversold here is not a panacea, he has two problems that are also very obvious:

  • If the remaining resources used in the current node are less than the value set by limit, PODS will automatically migrate to other nodes. 2) If you need to use the core binding function, it is a function that requires qos. Here, the request and limit must be set to the same configuration.

  • Set up automatic expansion, here you can set the threshold according to your respective performance bottlenecks, and finally realize the function of automatic expansion.

Most of the business is the performance of the cpu as the bottleneck, so the general way can set the expansion for the request usage rate of the cpu

Retrieval of millions of advertising orders

1. Advertising core search module

Advertising has a concept of a site set for each traffic, each site set is divided into different sets, in order to distinguish the impact of each traffic and different time-consuming requirements. In 20 years, we removed a set of each module and carried out CVM cloud migration. On this basis, in 21 years, we carried out containerized cloud migration for the core module sunfish. This module is characterized by highly CPU-intensive retrieval, so it cannot use hyper-threading (hyper-threading scheduling will increase time-consuming), and internal programs are bound to core processing (reduce CPU scheduling between multiple processes) .

2. The container is tied to the core

Here is the biggest feature of the advertisement, and it is also the biggest difference between TKE and CVM/physical machine.

In the CVM/physical machine scenario, the virtualization technology can obtain the correct single-core CPU information from /proc/cpuinfo, so in the original business core binding process, the CPU is obtained from /proc/cpuinfo Check the number and information, and carry out the binding operation of each program.

But the cpu information in the container has a big deviation. The reason is that /proc/cpuinfo is sorted according to the number of cores of the container itself, but this order is not the real cpu sequence of the container on the host computer. The real cpu sequence requires Obtain from /sys/fs/cgroup/cpuset/cpuset.cpus, such as the following two examples:

Display of CPU serial number in /proc/cpuinfo (false):

CPU serial number display in /sys/fs/cgroup/cpuset/cpuset.cpus (real):

As can be seen from the above two figures, /proc/cpuinfo is only sorted according to the number of allocated cpu cores, but it does not really correspond to the core number sequence of the mother computer, so in the process of binding cores, if 15 is bound The core number is actually binding the 15th core of the parent computer, but the 15th CPU of the parent computer is not allocated to the container.

Therefore, it is necessary to obtain the actual corresponding cpu sequence in the host computer from /sys/fs/cgroup/cpuset/cpuset.cpus to achieve core binding, as shown in Figure 2 above. And you can add the following command in the startup script to realize the format conversion of the real core number of the cpu, which is convenient for binding.

cpuset_cpus=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)
cpu_info=$(echo ${cpuset_cpus} | tr "," "\n")
for cpu_core in ${cpu_info};do
  echo ${cpu_core} | grep "-" > /dev/null 2>&1
  if [ $? -eq 0 ];then
    first_cpu=$(echo ${cpu_core} | awk -F"-" '{print $1}')
    last_cpu=$(echo ${cpu_core} | awk -F"-" '{print $2}')
    cpu_modify=$(seq -s "," ${first_cpu} ${last_cpu})
    cpuset_cpus=$(echo ${cpuset_cpus} | sed "s/${first_cpu}-${last_cpu}/${cpu_modify}/g")
  fi
done
echo "export cpuset_cpus=${cpuset_cpus}" >> /etc/profile

source /etc/profile calls environment variables, and the converted format is as follows:

Note: The binding of the core depends on the qos configuration (that is, request and limit must be set to the same)

3. Turn off Hyper-Threading

Hyper-threading is turned on in most scenarios, but needs to be turned off in computationally intensive scenarios. The solution here is to choose to turn off hyper-threading when applying for CVM.

Then taint and label the core machine that is shut down, so that ordinary pulls will not be pulled to the core machine. When you need to allocate shut down resources, turn on tolerance and set the label in the yaml, and you can get the corresponding shut down. Nuclear resources.

Core configuration when applying for yunti resources:

High availability in stateful service upgrade

The upgrade of a stateless container is the simplest, and the availability of a service port is the availability of the container.

However, the startup of stateful services is more complicated, and preparatory work for the state needs to be completed in the startup script. Advertising here mainly involves the pushing and loading of resources in the advertising order.

1. The continuous availability of advertising resources during the container upgrade process

The biggest difference between a container upgrade and a physical machine is that the container destroys the original container, and then pulls the new container from the new image to provide services, and the disk, process, and resources of the original container will be destroyed.

However, the advertisement order resources here are all millions. If the files need to be re-pulled after each upgrade, it will directly cause the startup to be too slow, so we have added temporary hanging directories in the container.

<img src="https://main.qcloudimg.com/raw/a4579ab4826d06e19e688e283ed2fee3.png" style="zoom:67%;" />

<img src="https://main.qcloudimg.com/raw/6865d1b811ed0d3060083b65d22a5ee6.png" style="zoom:67%;" />

This mounting method allows the container to retain the files in the above-mentioned directories during the upgrade process, without the need to re-pull them. But emptyDir can only be retained in the upgrade scenario. Destruction and reconstruction will still be destroyed and then pulled again. The following is the method of directly modifying the yaml for the inventory service:

              volumeMounts:
        - mountPath: /data/example/
          name: example-bf
      volumes:
      - emptyDir: {}
        name: example-bf

2. High availability of business during the upgrade process

In the process of business iteration, there are actually two problems that will cause the business to provide lossy services.

  • If load balancing is associated with workload, the first sentence of the container executing the startup script will become running. At this time, it will help you invest in the associated load balancing, but the business process is not ready at this time. In particular, stateful services must complete the pre-existing state before they can be started. At this time, the business will report an error due to the addition of an unavailable service.
  • In the upgrade process, except for the deployment mode upgrade, the original container is destroyed first, and then the new container service is pulled. At this time, a problem arises that when we upgrade, we first remove it from the associated load balance, and then immediately enter the destruction stage. If the upstream is called by L5, it is actually not able to quickly synchronize until the pods have been removed, and will continue to send requests to the downstream container that has been destroyed, and the entire business will report an error at this time.

So one of our main ideas here is:

  • How to bind the state of the business to the state of the container.
  • In the process of upgrading/destroying and rebuilding, is it possible to make a post script? We can do some logical processing before the destruction, the simplest is to sleep for a period of time.

Here we introduce two upgrade concepts for the business:

  • Probe ready
  • Postscript

    1) Probe is ready
    When the workload is created, it is necessary to select the ready detection for the port, so that it will be put into the associated load balance after the service port is started.

    <img src="https://main.qcloudimg.com/raw/15cdd81315a26cab80a6fb26eefe8700.png" style="zoom:67%;" />
    You can also modify yaml in the existing workload

      readinessProbe:
          failureThreshold: 1
          periodSeconds: 3
          successThreshold: 1
          tcpSocket:
            port: 8080
          timeoutSeconds: 2

Similar unhealty appears, which is the process of waiting for the service port to be available after the container is started

2) Post script

The core function of the post script is to perform a series of business-defined actions after being removed from the associated load balance and before the container is destroyed.

The order of execution is: submit operations for destruction, reconstruction/upgrade/shrinkage → remove Polaris/L5/CLB/service → execute post script → destroy container

The simplest function is to use L5 in the upstream call, sleep 60s after removing the L5, so that the upstream can update to the pods that are removed before proceeding with the destruction operation.

           lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "60"
  lifecycle:
          preStop:
            exec:
              command:
              - /data/scripts/stop.sh

According to long-term experience, the main problem is L5. If it is a service with a large traffic, then sleep within 60s will do. If it is a small request, I hope that there will be no error, and sleep 90s is required.

Achievement display

CVM and TKE usage and time-consuming comparison

Here, under the same configuration, the CPU and time consumption between the ordinary machine CVM and the TKE container are compared. It can be seen that there is basically no big difference, and the time consumption has not changed.

CVM:


TKE:

Overall income

Concluding remarks

Different from other introductions to cloud native sharing from the bottom, this article mainly introduces the advantages and usage methods of cloud native in large-scale online services from a business perspective, and combines Tencent Advertising's own characteristics and strategies to achieve Tencent Advertising in high Containerization practices in scenarios such as concurrency and automation.
For the business team, any work is based on quality, efficiency and cost, while cloud native can be improved from these three aspects. I hope that there will be more sharing from our Tencent advertising in the future.


Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝