author
Zhang Lu, operation development expert engineer, is currently responsible for the design and optimization of the background architecture of the game Zhiji AI assistant.
game knowledge
With the continuous expansion of the business, the game Zhiji AI intelligent question and answer robot business has covered self-developed games, second-party and overseas games. Game Zhiji R&D team actively embraces cloud native, promotes full back-end business to the cloud, and has accumulated core 1w+ services.
By means of containerized deployment, automatic scaling, health check, and observability on the cloud, the continuous delivery capability and stability of the Zhiji project have been improved, and a set of cloud-based practices suitable for Zhiji has been formed. This article will introduce the pain points encountered in the game Zhiji project and explore a set of reliable cloud practice solutions.
Zhiji project background
Game Zhiji is a game intelligent AI product and operation solution. Based on cutting-edge technologies such as natural language processing, knowledge graph, and deep learning, it provides one-stop services for gamers, including real-time intelligent Q&A inside and outside the game, game voice accompaniment, Self-service running water query, data exchange inside and outside the game, active care to prevent loss, product compliance protection and other capabilities, has been connected to 80+ six-star games including King of Glory, Peace Elite, PUBG mobile, Tiandao mobile game and other six-star games. It provides services for hundreds of millions of game users at home and abroad, and has won continuous praise from many game projects and users.
At the same time, Game Zhiji also provides an easy-to-use, high-performance client SDK and a full-featured operating platform system, which supports modular access, which significantly reduces the labor cost in user operations and improves the player's interactive experience.
With the continuous development of Zhiji's business, Zhiji's deployment architecture is also constantly evolving, gradually migrating from the original IDC deployment architecture to the current cloud-native deployment architecture, realizing the full cloud deployment of business services.
Know a few before going to the cloud
docker deployment scheme
Zhiji initially used the docker deployment scheme to deploy services. The CI/CD of the service was implemented through the Quark platform, and the platform pushed the compiled and packaged services to the docker machine for deployment. In order to realize the horizontal expansion of the machine, the operation and maintenance students will package the docker environment as a benchmark image as a whole, including the environment on which the machine environment of the IDC depends, such as CL5 agent, gse agent, etc. When expansion is required, the benchmark environment is released to the expansion machine for expansion.
The overall deployment architecture of Zhiji is shown in the following figure:
- External requests are uniformly accessed through stgw, and rs is sent to the vip of the background service, which usually distinguishes between China Mobile, China Unicom, China Telecom, and small-traffic operators;
- The IP and port of the machine mounted under vip are configured through the tgw platform, and the request is sent to the background service of the IDC machine through a certain load balancing strategy;
- The CI/CD of the service operates through the Quark platform to complete the compilation, packaging, and release of the service, and also supports operation rollback, process monitoring, etc.;
- The monitoring alarm and log system is connected to the mo monitoring platform and Junying.
Problems encountered with the service
The Zhiji project will connect with many games under IEG. With the increase of game access, the traffic has become larger and larger. The traffic status of the Zhiji project has the following characteristics:
- Usually the traffic is stable, and the traffic on holidays increases with the game traffic, usually more than 3 times;
- Active care-type news push, operation activities will be directly contacted to players through knowledge, bringing considerable burst traffic, more than 10 times in extreme cases;
Therefore, Zhiji has high requirements on service stability, observability, and service governance capabilities. It is necessary to ensure that the project can operate normally in the case of sudden traffic, and faults can be detected in time.
Under the architecture of docker deployment, it is difficult to achieve rapid automatic expansion and contraction. The main problems are as follows:
- The machine application and environment preparation before expansion is time-consuming. In the case of sudden traffic, this preparation time is unacceptable. Preparing the machine in advance will usually cause a waste of resources;
- The benchmark image produced by operation and maintenance is usually not the latest version, and the latest version needs to be released to expand the capacity;
- Dependent permissions (mysql, etc.) need to be applied for;
- The platform operation is cumbersome and error-prone;
- The shrinking operation of the machine after the operation needs to be completed manually.
These problems will cause the service to be untimely when expanding, which will bring hidden dangers to service stability, and also bring the operation and maintenance burden of business students. In addition, the annual machine dismantling is also very painful, involving all aspects of operations such as machine confirmation, service migration, and environmental grooming.
Therefore, we hope to use cloud-native HPA capabilities to solve problems such as service stability and abolition through cloud migration.
Migration to the cloud
Zhiji cloud native solution
In response to the above problems, Zhiji has implemented a set of cloud-native-based multi-machine room deployment solutions. The specific plans are as follows:
- Introduce the tapisix unified gateway, manage the north-south traffic with the help of gateway plug-ins such as current limiting, and stgw accesses the ingress of the tapisix gateway;
- The services are deployed in Nanjing District 1 and Nanjing District 2. The services in each district expose external network traffic through ingress, and the tapisix gateway is mixed into the ingress of the service in District 1 and District 2;
- Added Shanghai disaster recovery cluster for quick access in extreme cases;
- CI/CD is implemented through the blue shield pipeline, and the packaged service image is pushed to the mirror warehouse and deployed on STKE.
Based on the above deployment scheme, the above problems can be easily solved by using the cloud-native automatic capacity expansion and contraction capabilities:
- The timed HPA and dynamic capacity expansion and contraction capabilities provided by STKE can well solve the service stability problems caused by the sudden increase in traffic during holidays and operational activities, and the automatic capacity reduction after the traffic is stable can effectively save resources;
- STKE provides an automatic authentication process, which can solve the problem of relying on permission applications. Usually, the authentication process takes minutes;
- The tapsix unified gateway is introduced to access the traffic of the zone, which can quickly switch the traffic. When there is a problem with the service in one zone, it can be quickly switched to another zone through the routing of tapsix;
Migration plan
The cloud migration of Zhiji service designs many services on the external network and the internal network. During the process of migrating external network services, the operator can gradually grayscale the traffic:
- First, test the newly deployed service in the stke cluster, providing four types of public network CLBs of China Mobile, China Unicom, Telecom and CAP;
- Grayscale CAP small operator traffic first, and then gradually grayscale other operators through gslb;
- For rollback, quickly switch back to the VIP of the IDC service through gslb;
For the migration of intranet services, the Polaris and CL5 serivce supported by STKE automatically inject the pod ip into the load balancing of the old service. First, a pod is used for grayscale, and then the pod is gradually increased to complete the volume, and finally the IDC machine can be removed. In this way, we completed the full cloud migration of all external network services and background services within three months, and ensured the smooth progress of the migration.
Cloud practice
Standardized Deployment Practices
The point of cloud foundation for business is to consider how to do standardized container deployment and elastic services. There are three main types of Zhiji services. Business services are usually Go services, algorithm services are C++ services, and the problem of model loading needs to be considered. Platform services are mainly PHP services. In the standardization of containers, we adopt the single-container mode. The advantage of this is that each container does not affect each other. The process exists as the No. 1 process of the container. Once there is a problem, k8s will automatically pull up the service. Facilitate the reuse of resources. The rich container model is to put all processes in one container, which seems convenient and can realize seamless, smooth and fast cloud migration of business, but no matter from the perspective of future maintenance efficiency, security, health check, and service elasticity There are problems. It is an intermediate state, which violates the principle of single function of the container and does not conform to the concept of cloud native.
- The PHP service makes nginx, pfp-fpm, and business code into an independent container, and the code is shared to the php-fpm container implementation in the form of file sharing.
- Go services are relatively simple, using a standard container of conventional application container + sidecar.
- The algorithm service is mainly a model. The model file is mounted on cephfs and shared with the container of the C++ service.
In the process of service deployment, Zhiji has accumulated some practical experience. Through the advantages of cloud native resource utilization, it can improve the utilization rate of resources and reduce operating costs. The configuration of the minimum instance for different scenarios is as follows:
- The test environment and pre-release environment have less traffic, unified 0.1 core 0.25G, 1 instance.
- In the production environment, the business background service adopts 1 core, 2G, and 2 instances.
- In the production environment, the algorithm background service uses 8 cores and 16G (for example, online reasoning services will use machines with more than 32 cores).
By reducing the CPU and MEM request of a single pod, it can meet the daily operation needs. During the peak traffic period, the HPA capability of stke can be used to meet the business needs, so that the daily CPU utilization can reach 40%. Since HPA will lead to the expansion and contraction of the service container, if the traffic is connected before the service is started or the pod is destroyed while the traffic is still accessing, it will cause the loss of traffic. Therefore, it is necessary to enable readiness detection and prestop configuration.
It should be noted here that the startup delay setting of the readiness check is not easy to be too short, so that the system will think that the pod has failed to start and will continue to restart, causing the service to fail to start normally.
In addition, other features provided by stke can well meet the business needs of Zhiji:
- Provides an authentication process, which can dynamically apply for mysql and other dependent permissions when the pod is pulled up, avoiding the cumbersome permission application process.
- configmap resolves issues with configuration service configuration updates.
- The kernel can be tuned, and the business can optimize the kernel parameters according to the characteristics of services and traffic, such as net.core.somaxconn, net.ipv4.tcp_tw_reuse and so on.
At present, Zhiji has deployed more than 1w cores online, supporting Zhiji SDK, fifth person and other application services, and the overall utilization rate is about 40%.
HPA
The HPA capability provided by STKE can well meet the needs of Zhiji for expansion and shrinkage. Zhiji uses both timed HPA and dynamic HAP to meet different scenarios:
- For burst traffic, Zhiji uses CPU request and memory request as conditions to trigger expansion
- On holidays and on Friday and Saturday nights, minor games are launched, and Zhiji uses weekly HPA to expand in advance
This greatly reduces the mental burden of development and operation and maintenance students in the face of operational activities and sudden traffic, and improves service stability. In particular, regular HPA can easily meet Zhiji’s requirements for capacity expansion and contraction in the protection of minors. The system can complete the expansion and contraction of system capacity within a specific period of time, ensuring that the system can handle traffic smoothly while not will result in a waste of resources. After migrating to the cloud, Zhiji used this method to ensure the smooth progress of multiple operations during weekends and online.
observability
The observability of the system allows developers to quickly monitor and locate problems based on the system output. Observability can be viewed from three aspects: Metrics, Log, and Trace.
- Metrics, most of Zhiji services are connected to the Monitor system, and the monitoring of indicators such as mode adjustment information, service status, and business is realized through custom metircs reporting. Zhiji encapsulates the standard library of Monitor to realize the standardization and reporting of indicator templates. Monitor reporting needs to obtain the reported ip through http request, and then send the data to the Monitor side in the form of tcp. This form of reporting is not business-friendly, and the Monitor is no longer connected to new services. Metrics is migrated to Zhiyan monitoring system, and trpc provides plug-in access to Zhiyan monitoring capabilities.
- Log, filebeat was used to collect logs in the early days, and now stke provides a unified log data solution CLS, which can easily collect, store, and retrieve logs, with lower operation and maintenance costs and better experience.
- Trace, Zhiji accesses Tianji Pavilion to trace the request and record the context information such as the request link of the system. Marking and coloring requests by traceId greatly improves the efficiency of problem location. On this basis, Zhiji is also trying new distributed application development components such as dapr . The observable and non-perceptual access provided by dapr is cheaper than intrusive access methods such as Tianji Pavilion.
Summarize
Knowing that the entire cloud migration process is continuously improved and optimized with the improvement of the company's cloud-native system infrastructure. The company's co-construction in related fields has given the business more choices in the implementation process. I hope Zhiji's practice can bring value to more business teams. In the future, in terms of in-depth practice of cloud native, the team will make more attempts in the direction of cloud native standardization (mecha concept).
about us
For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~
Welfare:
① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~
②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.
③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"
④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.
[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。