author

Chen Zhiwei, a 12-level back-end expert engineer at Tencent, is now responsible for the public back-end technology research and development and team management of Happy Game Studio. He has rich experience in the development of micro-service distributed architecture and game back-end operation and maintenance.

Preface

The backend of Happy Game Studio is a distributed micro-service architecture. It currently hosts a variety of games with tens of millions of DAU and millions of online games. The original cloud architecture was born out of the QQGame background. The core architecture has a history of more than 10 years. It uses multiple sets of self-developed development frameworks with different purposes, self-developed basic components, and derives different services to adapt to complex business scenarios. The model eventually accumulated hundreds of microservices, and the overall simplified architecture is as follows:

With this large-scale platform-based back-end system and complex and diverse business architecture, it is necessary to continue to create greater business value, which brings greater challenges and pressure to the team. Briefly enumerate a few questions:

  • machine resource utilization rate is extremely low , the peak average utilization rate of the CPU in the cluster is less than 20%;
  • service governance capability is insufficient. . Due to the existence of multiple R&D frameworks and different service management methods, the maintenance of the overall business and the R&D cost of basic service governance capabilities are relatively high;
  • service deployment is very cumbersome , insufficient automation, time-consuming and labor-intensive, and easy to go outside the network;
  • lot of old business services lack of maintenance , underserved stale visualization capabilities, quality assurance is not easy;
  • overall structure of 161b72ec2ab435 is more complicated, and the cost for newcomers to get started is relatively high, and the maintainability is insufficient ;
  • annual abolition of the computer room will cost a labor costs 161b72ec2ab44e;

In the cloud-native era, with the company's comprehensive "embrace cloud-native" Dongfeng, we deeply integrated K8s and Istio capabilities, split modules by module, carefully sorted out, and experienced various stateless and stateless services to the cloud and protocol transformation. Framework transformation and adaptation, cloud native service model, data migration, improvement of peripheral service components on the cloud, establishment of cloud service DevOps process, and many other systematic engineering transformations. Finally, under the premise of non-stop service and smooth compatibility transition, the overall architecture of services will be clouded and gridded.

In the selection of cloud technology solutions for the overall architecture, we weighed the completeness, scalability, and renovation and maintenance costs of various solutions, and finally chose use the Istio service grid as the overall cloud technical solution.

Next, I will briefly introduce the cloud migration scheme of some modules according to the evolution of the original architecture.

R&D framework and architecture upgrades to achieve low-cost, non-inductive and smooth evolution to the service grid

In order to connect to Istio and smooth the transition of services, many adaptive adjustments have been made in the basic framework and architecture, and finally can be achieved:

  1. The existing business code does not need to be adjusted and can be reprogrammed to support the gRPC protocol;
  2. Calling between grid services, using gRPC communication;
  3. Services under the cloud call grid services, either using private protocols or gRPC protocols;
  4. Grid services call services under the cloud, using the gRPC protocol;
  5. Old business can be smoothly migrated into the grid;
  6. Compatible with the private protocol request on the Client side;

Next, briefly explain some of the contents.

The original architecture introduces gRPC

Considering the need for a more comprehensive application of Istio's service governance capabilities, we have introduced the gRPC protocol stack into the existing development framework. At the same time, in order to be compatible with the communication capabilities of the original private protocol, gRPC is used to package the private protocol, and compatibility processing is done at the development framework layer and the architecture layer. The schematic diagram of the development framework structure is as follows:

Use MeshGate to bridge the grid and services under the cloud

In order to enable services in Istio on the cloud to interoperate with services under the cloud, we developed the MeshGate service to bridge the grid on the cloud and the service under the cloud.

The main function of MeshGate is to realize the bilateral proxy registration of services inside and outside the grid, and realize the mutual conversion and adaptation between gRPC and private protocols. The architecture is shown in the following figure:

Architecture evolution

Based on the ability of business reorganization to support gRPC and the compatibility of services inside and outside the grid, we can realize the smooth migration of new and old services to the cloud.

Of course, in the process of migration, we are not blindly moving to the cloud with containers. We will perform targeted cloud-native processing and service quality reinforcement for various services, and improve the observability of services, and ultimately improve the maintainability of services. Performance and resource utilization.

After the service goes to the cloud, its resource configuration granularity becomes Pod level and supports automatic scaling capabilities. Therefore, there is no need to reserve too many resources for specific services, and most services can share Node resources. In turn, the utilization rate of machine resources can be greatly improved, and the can be reduced by about 60-70% .

In addition to the benefits of reducing machine resources, the service uses the helm declarative one-click deployment model, so that K8s can better maintain the availability of the service, and the architecture also obtains Istio's powerful service management capability . In the end, the DevOps performance of the business was greatly improved.

The evolution of the overall architecture is shown in the following figure:

However, careful students may find that after the service is on the grid, the communication between the business and the client side needs to be forwarded from the self-developed access cluster Lotus to MeshGate, and multiple protocol conversions and forwarding are performed, resulting in performance overhead of the communication link And the delay increases. As for the delay-sensitive business scenarios in the game, the loss of delay is unacceptable. Therefore, we urgently need a gateway access service grid. Next, we will introduce the transformation plan of gateway access.

Private protocol access service in the grid

The self-developed access cluster Lotus under the original cloud is the based on the private protocol of the TCP long link . It has the capabilities of service registration, large-scale user link management, communication authentication, encryption and decryption, forwarding, etc. .

In addition to the loss of communication effect after the aforementioned services are migrated to the grid, there are also some other problems:

  1. The operation and maintenance of the Lotus cluster is very cumbersome; because in order to prevent the user from having a bad experience due to the disconnection of the link during the game, the Lotus process needs to wait for the user to actively disconnect, and the new link will not be sent to the waiting In Lotus, in short, Lotus needs to empty the existing long link , which also causes the update of Lotus to wait a long time. We have counted that every time a new Lotus version is released on the entire network, it takes several days. When encountering problems, abolishing or adding new nodes, the changes require manual adjustments to the entire network configuration strategy, and more than ten steps are required, and the overall efficiency is low.
  2. The resource utilization rate of the Lotus cluster is low; because Lotus is the most basic service and it is not convenient to deploy, it is necessary to reserve sufficient machine resources in order to cope with changes in business traffic. But this also led to the low resource utilization of Lotus, and the daily peak CPU resource utilization was only about 25%;

To this end, based on the CNCF’s open source project Envoy , we support the forwarding of private protocols, and connect to the Istio control plane to adapt it to our original business model, and realize private communication authentication, encryption and decryption. , Client link management and other capabilities, and finally complete the work of accessing the server to the cloud. The overall technical framework is shown in the figure below:

After the transformation, the cloud access cluster has been improved in all aspects.

  1. private protocol in the core business scenario are close to those in the cloud environment ;
    For core business scenarios, we have done corresponding stress tests. After Envoy supports private protocols, the performance overhead and latency of its access and forwarding are of the same order of magnitude as direct connections under the cloud. The test delay is shown in the following table:

    ScenesAverage timeP95 time consuming
    Direct connection under the cloud0.38ms0.67ms
    Forwarding between K8s pods0.52ms0.90ms
    Istio + TCP forwarding (private protocol)0.62ms1.26ms
    Istio + gRPC forwarding6.23ms14.62ms
  2. Naturally supports Istio's service governance capabilities and is closer to the use of cloud-native Istio;
  3. Through the Helm and the definition of Controller management , one-click service launch and rolling update are realized; the entire upgrade is automatic, and the draining update capability is realized during the process, and the load capacity is considered, and the draining efficiency is better.
  4. Because of the support for auto-scaling capabilities, access services do not need to reserve too many resources, so resource overhead can be greatly reduced; after full CPU connected to the cluster can save 50%-60%, and the memory saves about 70% .

Architecture evolution

With the access cluster on the cloud, the overall architecture evolution is shown in the figure above. Next, we will take GameSvr in the game business as the representative of the strong state service of the game, and briefly introduce its cloud migration solution.

GameSvr on the cloud

In the past, Joy Studio used mostly single-game room games ( Note: At present, it is far more than that, and there are also various game categories such as MMO, Big World, SLG, etc. ). The architecture of GameSvr under the cloud is shown in the figure below:

However, the above architecture has some problems under the cloud:

  1. operation and maintenance of cumbersome. 161b72ec2abbc6; a single GameSvr requires more than ten steps of manual operation for loading and unloading. It takes several weeks of manpower to abolish the machine every year, and it is prone to accidents;
  2. low resource utilization rate ; also because scaling is not easy, sufficient resources need to be reserved for redundant deployment, resulting in a peak CPU utilization rate of only about 20%;
  3. overall disaster tolerance of is weak 161b72ec2abc1f, manual intervention is required after the downtime;
  4. game scheduling , which relies on manual configuration of static policies;

Therefore, with the help of cloud native capabilities, we have created a single-game GameSvr architecture that is easy to scale, easy to maintain, and highly available. As shown below:

During the entire process of migration to the cloud, we without changing the front-end, and users smoothly transitioned to the cloud grid GameSvr cluster . Finally achieved:

  1. Obtaining significantly improved resource utilization; overall CPU and memory usage are reduced by nearly 2/3 of the .
  2. Operation and maintenance efficiency has been greatly improved; through the management of customized CRD and Controller, Helm can deploy the entire cluster with one click, and it is very convenient to get on and off the shelves. Only one business project team can effectively save nearly 10 man-days due to the release of GameSvr every month;
  3. GameSvr can achieve reliable automatic scaling according to the current cluster load pressure changes and the time series of historical load pressures;
  4. A flexible and reliable single-office scheduling capability is realized; through a simple configuration, a single-office can be scheduled into different Sets according to different attributes. And in the process of scheduling, load and service quality are also considered, and finally a better choice for overall scheduling is realized.

Architecture evolution

After GameSvr goes to the cloud, the overall architecture changes as shown in the figure above. Next, let's look at how CGI goes to the cloud.

A large number of CGI goes to the cloud

We have used CGI under Apache on a large scale as a framework for operational activities development. But some current status of the original CGI business:

  1. There are many types of services, about 350 CGI services are deployed on the existing network, and the traffic is huge;
  2. The CGI synchronous blocking process model leads to extremely low throughput of a single process; the QPS of most CGI services is only single digits, and there is the performance overhead of Apache's service scheduling and message distribution;
  3. The resource isolation between CGIs is poor; because CGI is a multi-process deployment on the same machine, it is very easy to happen due to a sudden increase in the CGI resource overhead of a business, which affects the CGI of other businesses;

In the face of a large number of CGIs with low performance and low efficiency, a cloud low R&D cost and resource overhead is required. At the beginning, we tried to package Apache and CGI as a whole into a simple container to go to the cloud, but found that the resource cost and deployment model are very unsatisfactory, so we need a more elegant cloud plan.

Next, we analyzed the traffic distribution of CGI and found that 90% of business traffic was mainly concentrated in 5% of CGI, as shown in the figure below.

Therefore, we have made some differentiation and transformation to the cloud for the CGI of different traffic.

  1. For the header traffic CGI, the asynchronous transformation of the , and the Apache is stripped, so that the framework performance has been improved dozens of times.

    • Realize the monitoring of http requests and asynchronization in the framework layer:

      • Use http-parser transformation, so that the framework itself supports http monitoring and processing;
      • Based on the libco , the bottom layer of the framework supports coroutines, thereby achieving asynchronous;
    • At the business layer, various types of adaptive processing are also required:

      • For global variables, privatize or associate with coroutine object management;
      • Various resources such as back-end network, io, configuration loading and memory are optimized for reuse to improve efficiency;

      In the end, minor adjustments on the business side can complete the asynchronous transformation of the coroutine. But even if the cost of transformation is low, there are still too many CGIs, and the cost-effectiveness of full asynchronous transformation is very low.

  2. For the remaining long-tail traffic CGI, it is packaged with Apache, and the script is used to relocate to the cloud at one time. In order to improve the observability, special processing has been done for the metrics collection export of super multi-processes in this single container.

Finally, in the process of going to the cloud, make full use of Apache's forwarding mechanism to realize the gray-scale rollback to the cloud.
After going to the cloud, the overall resource utilization and maintainability of CGI have been greatly improved. fully clouded, the CPU can save nearly 85% of the number of cores, and the memory can save about 70% .

Architecture evolution

After the relocation of CGI, the overall structure is shown in the figure above. Next, I will introduce the transformation plan of self-developed storage CubeDB.

Self-developed storage business migration

We have self-developed storage data with dozens of T under the cloud, and build hundreds of MySQL tables by ourselves. The overall maintenance cost is high and it is difficult to go to the cloud. Therefore, our solution is to "delegate professional tasks to professional people" to migrate storage to TcaplusDB (Tencent IEG self-developed public storage service). The overall migration steps are briefly described as follows:

  1. Developed an adaptation proxy service, namely Cube2TcaplusProxy shown in the figure above, which adapts and converts the CubeDB private protocol to TcaplusDB, then the storage of new services can directly use TcaplusDB;
  2. The hot data of the business is synchronized on the standby machine of CubeDB. After the synchronization is turned on, TcaplusDB will have the latest data of the business;
  3. Import cold data into TcaplusDB. If there is record data in TcaplusDB, it means that it is the latest, and it will not be overwritten;
  4. Compare the data of MySQL and TcaplusDB in full, and switch the Proxy route after multiple verifications are passed;

Finally, through this solution, we realized the lossless and smooth migration of DB storage.

Architecture evolution

After the transformation of our self-developed storage service, most of the services can be uploaded to the cloud. At the same time, we have also done a lot of building and application of peripheral capabilities on the cloud, such as the unified configuration center on the cloud, grafana as code, promethues, log center, color call chain, and so on.

The final architecture evolved into:

Multi-cluster deployment mode

Under the cloud, we have a region-wide, full-server architecture, and all game businesses are in a cluster. However, due to our organizational structure and business form, it is expected that after going to the cloud, different business teams will work in different business K8s clusters, and the services shared by everyone will be managed under the public cluster. Therefore, in the process of migrating to the cloud, more adaptation and migration work needs to be done.

At the Istio level, since our Istio service is hosted by the TCM team ( Tencent Cloud Service Grid TCM ), with the strong support of TCM students, combined with our current organizational structure and business form, we can realize Istio multi-cluster control plane information As a result, we call each other between multiple clusters, and the cost is very low. The following is the background architecture related to TCM:

Summarize

Finally, under the complex game business architecture, after careful analysis and continuous reconstruction and evolution based on cloud native technology, we deeply integrate the capabilities of K8s and Istio to finally achieve a smooth and high-quality cloud and grid architecture in the game business scenario. It has a multi-frame and multi-language micro-service framework, automation, service discovery, elastic scaling, service management and control, traffic scheduling management, three-dimensional measurement monitoring and other capabilities, and accumulates cloud experience in various business scenarios of the game. The reliability, observability, and maintainability of business modules have been greatly improved, and the overall research and operation efficiency has been significantly improved.

Happy Game Studio owns several national chess and card games such as Happy Landlord, Happy Mahjong, Happy Upgrade, etc. At the same time, it is researching a variety of games such as Big World, MMO, SLG, etc. Now it is recruiting a large number of research and development, planning and art. Position, welcome to poke the recruitment link , recommend or submit your resume.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝