Author
Tang Yingkang, Tencent senior engineer, Kubernetes open source collaboration PMC, responsible for the containerized cloud-related work of TEG Information Security Department.
introduction
As of May 2021, the TEG Principal O&M team has completed the platform capacity building from 0 to 1 for TKE containers. The total scale of the cluster exceeds 600,000 cores, and it has achieved success in resource cost, service quality and operational efficiency. Obvious benefits.
This article will introduce the construction ideas and process of Principal's TKE container, share the problems and solutions encountered at each stage, and hope to provide some references for other cloud teams.
background
Principal is committed to providing professional content security solutions, and has undertaken a large number of business security needs both inside and outside the company. With the rapid growth of business volume, the scale of Principal's internal resources is also increasing, and it is also exposed in terms of machine cost, service quality and delivery efficiency. There is a lot of room for optimization; since the company advocated open source collaboration and researched on the cloud, TKE (Tencent Kubernetes Engine) has become the first internal cloud method, and the container technology with docker and kubernetes as the core has also become an industry architecture upgrade. The mainstream choice.
Therefore, in the first half of 2020, Principal, together with TKE, Zhiyan, and the operation management team, started the construction of the Principal Container Platform.
Construction ideas and overall planning
Principal's thinking and process of container platform construction can be summarized as "one direction, three stages":
[One direction] Align with the concept of cloud native, select solutions around the cloud native open source ecosystem, and keep up with the industry’s advanced infrastructure upgrade direction;
[Three stages] Cloud native has a troika: container cloud + Devops operating system + microservices, which respectively correspond to the three stages of the journey of containerization.
- Phase 1: Basic platform construction, introducing container technology into the service architecture and adapting, and completing the capacity building, business transformation and architecture upgrade from 0 to 1;
- Phase 2: Operational capability iteration, the main task is to improve research efficiency around the container platform, improve data operation capabilities, and establish a stability guarantee system;
- Phase 3: Cloud native maturity enhancement, based on the cloud native maturity model released by the company, using maturity scores as the starting point to promote continuous optimization of the existing architecture.
Basic platform construction
Platform capacity building
With the assistance of the TKEStack team, Principal successfully completed the construction of independent CPU clusters in Shenzhen, Guangzhou, Shanghai, and Nanjing. Combined with the console page of TKEStack, it quickly acquired the basic capabilities of container publishing and management.
In addition to general capabilities, in terms of adaptation to individual needs, there are mainly:
- Instance fixed IP: realized by the FloatingIP plug-in, when creating the container, select the floating IP in the network mode, and the IP recovery strategy selects "recycle when shrinking or deleting APP" to achieve
- CMDB synchronization: realized by the CMDB controller plug-in, after the instance is started, the plug-in will automatically register the pod IP to the business module specified in the annotation
- L5/Polaris service registration: through the LBCF plug-in, after the instance is running, the instance IP and designated port can be registered to L5/Polaris. When the instance is deleted, it can be offline from L5/Polaris
GPU clusters have also been built and put into use in Shenzhen and Shanghai, and designed a scheme based on Virtual Kubelet to register the CPU resources of the GPU cluster in the CPU cluster for scheduling, which can better reuse the idle CPU resources on the GPU machine.
Business containerization transformation
With platform capabilities, the next step is to containerize the services deployed on the existing IDC/CVM. The main difficulties here are:
- Service diversity, including stateless, stateful and GPU services. The package of a single service is very large (up to dozens of G), and the service has its own dependent environment, etc.
- Need to consider multiple R&D operation process nodes: including development and transformation, test process specification, Dockerfile image production, CICD process, operation and maintenance change specification, etc.
We have mainly adopted the following measures:
- First, sort out the transformation process of services in different scenarios, and formulate unified development, testing, and operation and maintenance specifications in-house
- In the selection of the basic image, we have made the basic image based on the machine environment standardized by Principal to reduce the probability of abnormal service caused by environmental problems.
- The Dockerfile template is abstracted, which reduces the cost of service transformation (almost 0 participation by development students)
- Unify the start entry (Entrypoint and CMD) in the container, start the service in the management script
- Before starting some services, you need to write the container IP into the configuration file. We do this in the management script
- For software packages and images that are too large in size, consider splitting and managing the file models, such as data model files, and then go to the file distribution system to pull them when the service is started.
Resource utilization optimization
On the one hand, each service is divided into a separate namespace space, combined with ns's ResoureQuota to configure and limit the resources that the service can use;
On the other hand, in order to improve the overall resource utilization of the cluster, a reasonable cpu resource is oversold during the actual deployment of each service, that is, the limit value of the cpu will be greater than the request value; generally it will be combined with the current utilization rate and target utilization of the service Considering the importance of service rates and services, the specific oversold ratio calculation formula is:
Target utilization rate/ (current utilization rate * service weight)
In this way, for long-term low utilization services, without adjusting the instance, the CPU utilization of the host machine can be greatly improved.
Dispatch management
By enabling K8s' own HPA and CA capabilities, two levels of elastic scheduling of the overall architecture from the application layer to the cluster layer can be realized.
Since we are deploying across regions and multiple clusters, we need to configure and manage HPA for multiple services and multiple clusters, and regularly determine whether the current service's HPA strategy is reasonable based on the actual resource usage;
To this end, we designed and developed the HPA management system, which achieved:
- Classify and manage HPA policies. For services with similar business request fluctuations, scaling indicators, thresholds, and priorities, you can use a scheduling class to create HPA policies in batches
- In cross-cluster, HPA is issued and configured through a unified portal, without the need to configure and manage separately in each cluster, improving management efficiency
- By regularly pulling the monitoring indicators of the service, you can output the current actual resource usage of the service, judge the rationality of the HPA strategy and modify it
Operational capability iteration
Devops research efficiency program
[CICD]: Based on the existing Zhiyan CI pipeline, add [Zhiyan-mirror build] and [Zhiyan-product storage] plug-ins to convert the compiled software package into a mirror image and push it to the unified software source to achieve CI Interfacing with TKE
[Monitoring]: Inject the agent into the base image and start it with the management script
[Log]: Business logs are directly reported to Zhiyan, event events at the k8s cluster level are stored in Elasticsearch through the event persistence component for subsequent query
[Configuration Management]: Inject the colorful stone agent into the basic image, start with the management script, and the service will be configured with colorful stone
Data operation system construction
In order to better fit the business operation scenario, we designed and developed the kbrain system, which connects to the k8s cluster in the background, pulls the basic data and stores it in the cdb after secondary processing, and displays the view data and reports of the business dimension through the grafana configuration, and Valuable data such as health scores, cost benefits, oversold ratios, and resource fragmentation examples are introduced as operational references.
For services that have been configured with HPA elastic scaling strategy, you can also check the effect on the HPA dispatch board, as shown in the following figure:
There are 4 curves: current utilization rate, target utilization rate, current number of copies, target number of copies
The trend chart proves that when the current utilization rate increases, the number of target copies increases, and after the expansion is completed, the utilization rate returns to the normal level.
Stability Guarantee System
For a large-scale basic platform like TKE, a complete stability guarantee system and platform support are indispensable. Mainly consider from the following aspects:
- SLA specification: As the business side, it clearly aligns the SLA specification with the TKE platform supporter, including unavailable definitions, failure response mechanisms and processing procedures, platform operation and maintenance changes specifications, holiday specifications, and problem rise channels. This is stable basis
- Stability indicators: sort out all indicators that can feedback the platform and business stability, and gather them and display them on the grafana platform after pulling them through the background
- Capacity management: real-time monitoring of the remaining resource level of the cluster to avoid failures caused by insufficient capacity
- Plan mechanism: sort out all possible failures, and design an emergency plan for rapid recovery in advance
- Monitoring alarm: Check whether the cluster and service are connected to and reported effective monitoring indicators, and you must be able to receive the alarm as soon as there is a problem
- Chaos Engineering: Collaboratively introduce the chaos engineering oteam, build fault drill capabilities around TKE, and test the reliability of SLA, stability indicators, capacity systems, and plan mechanisms
Cloud native maturity increased
After the first two phases of basic platform construction and operation capability iteration, we have basically completed the construction of container capabilities from 0 to 1, from scratch; the next step is to transform the existing container from "existence" to "excellent". The architecture and business transformation model move closer to a more cloud-native direction.
Rich container slimming
In the early stage, due to limited human investment, we adopted a rich container approach to transform our business, that is, using the container as a virtual machine, and injecting too many basic agents into the container's basic image.
Based on the cloud-native concept and the correct opening posture of the container, we thin the container and split the original container into: 1 business container + multiple basic agent containers, completely strip the large files from the mirror, and use the cloud disk/ Implementation of cloud storage/file distribution system.
After splitting, there will be multiple containers in one pod. If files need to be shared between containers (for example, business processes read and write socket files of Zhiyan monitoring agent), they can be implemented using emptydir volume mounting.
Small core transformation + Overlay network optimization
For TKE container clusters, if most services are large core configurations (for example, 36c, 76c or even larger), more resource fragments will be generated, and the remaining allocatable resources of some nodes (2c, 4c, 8c, etc.) cannot Really allocate it, resulting in a waste of resources. Therefore, from a macro perspective, it is necessary to cut the container resource configuration of most services as small as possible.
Principal’s service was accustomed to and optimized for the multi-process architecture before the containerization transformation. A single machine often runs 48/76 processes (or even more); after the containerization transformation operation, due to oversold and based on the acquired The number of processes running with the cpu limits value will be much higher than the cpu requests value, causing some processes to be abnormal (cannot really run with the core).
Therefore, you need to modify the service running configuration, switch to a small core configuration and run in a container with a requests=limits value.
However, as the container cuts its core, the number of instances of the service will also increase. For example, switching from 76-core configuration to 38-core, the number of instances will double; from 76-core to 8-core, the number of instances will be almost 10 times larger. If the service still uses the FloatingIP network architecture at this time, it will be a huge overhead for the IP resources at the bottom of the cluster.
Generally speaking, it is sufficient to switch to the Overlay network, but the delay of the Overlay network will be 20% higher, and because it is the only virtual IP in the cluster, it cannot be registered to L5/Polaris and does not support monitor monitoring.
After research and demonstration with the TKE team, the final design adopts an improved HostPort network solution based on Overlay: the instance Pod still gets the virtual IP in the cluster, but the internal service port of the Pod can be mapped to the random port of the host, and then The host IP+random port is registered to L5/Polaris, which not only realizes the service registration, but also minimizes the delay.
For the pilot service (vulgar recognition), we first completed the deployment and testing of the small core transformation + Overlay network, and verified:
The throughput and latency under overlay+HostPort network are close to FloatingIP
Summary of Achievements
At present, Principal has basically completed the construction tasks of the two phases of basic platform construction + operation capability iteration, and is working hard to improve its cloud native maturity. As of May 2021:
1) Principal TKE cluster has a cumulative scale of 62w+ cores, and GPU card containerization exceeds 2000 cards
2) The business has accumulative access to 40+ Principal Services, accounting for about 33% of the conversion services
3) Cumulative cost savings are about 20w cores, and the quality and efficiency have also been significantly improved.
Concluding remarks
There is a famous saying in the technology circle: “ people tend to overestimate the impact of new technology in the short term, but greatly underestimate its long-term impact ”.
A new technology must go through a process from creation to implementation. It requires a relatively large amount of investment in a short period of time. The upgrade of the architecture will also bring pain to other projects and personnel of the team; but as long as the direction is clear and the direction is firmly moved With its hard work, we are bound to usher in qualitative change and reap the gifts we hope for.
The road is long and long, I hope everyone can go further and better on the road of cloud native construction!
Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. Provide users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service management, automated DevOps, and a full set of monitoring operation and maintenance systems.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。