腾讯云 - Wonderful Sharing | Happy Game Istio Cloud Native Service Grid Three-Year Practical Thinking - 个人文章

author

Wu Lianhuo, Tencent game expert development engineer, is responsible for the large-scale distributed server architecture of Happy Games. With more than ten years of experience in microservice architecture, he is good at distributed systems and has rich practical experience in high performance and high availability. Currently, he is leading the team to complete the comprehensive transformation of the cloud native technology stack.

Introduction

Happy Games has introduced istio service grid since 2019, and it has been nearly three years since research and large-scale implementation. This article makes some reflections on the practice process, hoping to give students who are interested in grids a reference.

Before the beginning of the text, let's clarify the concept of service mesh (service mesh) mentioned in this article - a back-end architecture-level solution based on sidecar communication proxy and mesh topology. Currently the most popular open source solution in the industry is istio.

The architectural idea of service mesh is decoupling and adding a layer. By decoupling the basic governance capabilities from the process and providing them in the form of sidecars to achieve larger-scale reuse, once standardized, it will be industry-level reuse! In fact, as a hot solution in the field of microservices, the decoupling and splitting process done by the grid itself is very "microservice", because microservices are actually process splitting, decoupling and independent deployment to facilitate services. reuse.

Status and benefits

State of the technology stack

Programming language: go/c++/python
meta system: protobuf (used to describe configuration, storage, protocol)
RPC Framework: gRPC
Unit testing: gtest
Containerization: docker + k8s
Grid: istio, envoy gateway
Configuration: PB-based excel conversion tool, configure the distribution management center
Monitoring: prometheus
Others: code generation tools, blue shield pipeline, codecc code scanning, helm deployment, etc.

core income

Technical values: The technical values of the team have become more open, embracing the open source ecosystem and staying close to the cloud-native technology stack.
Team growth: The evolution of the big technology stack is a challenging task, and the teammates will naturally improve their abilities after fighting monsters and upgrading.
RPC framework: By introducing gRPC, the cross-language RPC framework is unified. The original self-developed RPC framework is still in use, but the bottom layer will be adapted to gRPC.
Introducing golang: Introducing golang for regular feature research and development, which improves the efficiency of research and development.
Grid capability: No development is required, traffic management is performed based on istio's virtual service, traffic is scheduled by label aggregation and version, and consistent hashing is used.
Machine cost: This is the only relatively quantifiable benefit point. The accurate data will not be given until the subsequent 100% cloud migration is completed. In a preliminary estimate, it may be 60% to 70% of the original.

In general, what we are doing is the evolution of a large technology stack system, which reflects better technical value and improves R&D efficiency. So for us, looking back now, is meshing worth the practice? The answer is still yes.

But if we put aside the background of the evolution of the technology stack and look at the grid itself, frankly speaking, our use of grid capabilities is relatively preliminary:

Transfer or not: circuit breaker, current limit, retry (on the premise of idempotence), which has not been practiced yet.
Who to transfer the package to: Name service, practice, using virtual service, using maglev consistent hash.
Debugging functions: fault injection, traffic mirroring, not implemented yet.
Observability: tracing is turned off, and it has not been practiced yet.

Considering the actual cost, we have not intercepted inbound traffic, so if there are functional features that depend on this, it is currently impossible to practice.

The real selling point of grids

From the author's personal observation, the most attractive istio grid is actually two points:

Open the imagination space of the technology stack. As the entire ecosystem of istio, envoy, and gRPC becomes more and more abundant, more capabilities may be provided in the future. Out of the box, the business team does not need to invest in development.
Multi-language adaptation, no need to develop and manage sdk for each language, for example, envoy written in C++ can be used by all services using gRPC.

As for the capabilities of fusing, current limiting, equalization, retry, mirroring, injection, and tracing monitoring, strictly speaking, they cannot be counted on the grid head, and the same can be achieved with sdk. When the team language is unified, only one language version of the SDK needs to be maintained. At this time, it is also feasible to adopt the governance SDK solution, which is the so-called microservice framework solution. The problem of version maintenance in the sdk mode and the problem of further evolution of the grid in the later stage are not difficult to solve, and will not diverge here.

For ourselves, because golang and gRPC happened to be introduced, it is appropriate to choose istio as the grid solution now.

Grid thinking practice

some preconditions

To access the grid, consider the right time and place. That is, some basic conditions need to be met:

It needs to be allowed in the project stage. If the team itself has been iterating on the content of the fast version, and the business needs are too busy, I am afraid it will be difficult to have manpower guarantee.
There must be infrastructure environment support (we use Tencent Cloud's tke mesh service), so that everything will not start from scratch.

In addition, for such large technical optimization, it is necessary to unify the thinking first:

From top to bottom, get the approval of management stakeholders at all levels, so as to make greater human input.
From bottom to top, mobilize the students to deeply participate in the discussion, so that the overall direction and plan can be recognized by everyone, so that everyone can be motivated.

Think before you act

In the early conception stage, there are a few big key questions that need to be clarified:

1) What do you want to achieve? Save machine cost / improve R&D efficiency / cultivate team / technology stack evolution? If we want to achieve the corresponding goal, are there other better paths?
2) Is there any risk of loss of control? Is the performance unacceptable, is the stability of k8s and istio sufficient, and is there a risk of extreme availability?
3) How to transition smoothly? In the process of service relocation, can the R&D model be smoothly transitioned?

For the first point, different teams should look at their actual situation, and we will not expand them here.

For the second point, k8s has many large-scale applications in the industry, so it is still reliable, and its level triggered design makes it more robust. What is relatively unknown is istio. The team did some stress tests on istio at the beginning, and also considered the situation of no mesh rollback. The conclusion is that it can be tried. istio is essentially a complex and large-scale software, so its main daunting points are its complex configuration, compatibility concerns between versions, and poor controllability of the black box. Now that I think about it, in fact, our team has come a long way. Fortunately, the subsequent landing process shows that the stability of istio itself is not bad, and there will be no problems in three days.

For the third point, we specifically designed the introduction of an indirect layer, using a private protocol and gRPC interconversion gateway to ensure the smooth migration of services to the cloud, and the introduction of a grpc adaptation layer within the service to ensure the basic development model of developers. constant.

Overall system architecture

The overall architecture of the system is shown in the figure below, and the indirection layer mentioned above can be clearly seen:

Figure: gRPC adaptation and communication proxy inside and outside the grid

Cloud native R&D experience

For points that are not very different from evaluating requirements, formulating plans, and writing code, we will not expand them here. The following mainly lists some R&D experiences that are significantly different from our previous ones under the cloud-native technology stack system.

helm : Manage the internal and external network yaml of all services through helm, and fully describe all its deployment dependencies in the service's own yaml.
Test environment dev copy : Because there are too many system services, although the internal network is the debug version, the resource consumption is much lower than the release version, but considering the complex inter-service dependencies, it is not advisable to deploy a test environment for each person, so At present, there are still a few selected environments, which are reused by everyone. For the conflict of multi-person self-test environment, we use the ability of grid to deploy dev copy based on uin, so that when student A develops a specific service, his own request will fall on his own dedicated deployment.

Figure: Deploying different dedicated deployments based on different numbers

The test environment is fully automatically built and deployed every day: but this also brings a problem. After the pod is rebuilt and drifted, the log, coredump and other information do not match. For example, the test classmates reported that they encountered a problem the day before, and then the development did not know the day before. In which pod (already destroyed). We set the k8s node affinity policy preferredDuringSchedulingIgnoredDuringExecution, combined with the fixed log path (deployment name instead of pod name), to ensure that the pod in the test environment is still in the original node after rebuilding, and the log path is also consistent, so that new pods entering the same service can be You can continue to see the logs from the previous day.
External network canary version : used during the grayscale period, open it directly through the deploymentCanary configuration item of yaml, and use the istio virtual service to configure the grayscale traffic ratio. Troubleshooting external network problems is sometimes enabled. For dyed numbers, traffic will also be imported into the canary version. The specific implementation is that the gateway process will read a number list configuration. As long as the number request is in the list, it will put the relevant label on the gRPC header, and then import it into the canary version based on the routing capability of vs.
hpa practice : I am still a little hesitant about the earlier attitude of the hpa author, because this essentially makes the timing of service deployment and release uncontrollable, unlike the release of conventional manual intervention. There have indeed been some problems online, such as hpa (which depends on the metric link associated with hpa to be unblocked) at night, which leads to service overload; and before log collection is completed, hpa causes pod drift, and an alarm for a pod the night before Information, it is more difficult to read the next day, and you have to go to the node that was scheduled before; in addition, there has been a problem that the process hpa cannot be started, the configuration is incorrect and cannot be loaded and initialized successfully, and the running process will only The reload fails, but if you stop and restart, it will fail to start. However, hpa is still very valuable for improving resource utilization, so our current practice is to treat them differently. For ordinary services, the number of min replicas can be small, and for important services, the number of min replicas is configured to be slightly larger.
Graceful start and stop : directly based on k8s ready and alive probe implementation.
External network log collection : This is a relatively easy-to-use platform service that has not been used before. The business itself has recorded rsyslog remote logs. Later, cfs may be used to link the network disk, which can be considered as a makeshift.
Configuration system : protobuf is used for configuration definition, configuration analysis is based on code generation, configuration distribution is based on rainbow, configuration pull is based on configAgent, configuration archive expression is placed in svn in excel form, and excel to program reading format is completed with tools conversion. configAgent is a container that webhook dynamically injects into pods.
Monitoring system : prometheus, cloud monitoring.
DEBUG_START environment variable : In the early days of containerized deployment, we encountered some processes that failed to start, repeatedly failed to pull up, and then pods drifted everywhere, which was inconvenient to check, so we added the DEBUG_START environment variable, if set to true When the process fails to start, the container is not exited.
Due to some security permission reasons, perf on the cloud cannot be perfed in the container. Now it is temporarily applying for the root permission of the node for perf. It is necessary to put a binary file on the node, otherwise perf cannot parse the symbol information. For the go service, it is directly profiled using its own toolchain.
Retention of problem pods on-site : Since deployment is identified based on label, it is very simple to keep the faulty pod on-site in the external network, just change the label directly.
coredump view : After the segfault signal is captured, the binary itself will be copied to the coredump folder, and you can view it by attaching it to any pod currently alive on the coredump node.
Code generation : This actually has little to do with whether or not to go to the cloud, but we have done a lot of work based on protobuf (for example, using .proto to define configuration files to provide functions similar to xresloader), which is very beneficial, and is listed here. After sorting out the relevant code and improving the documentation, open source will also be considered.

Performance situation

Before discussing performance, let's talk about our practice method: turn off tracing, turn off inbound interception (when remote traffic comes, it will not go to sidecar).

Illustration: The business container on pod1 calls the service on pod2 and only intercepts outbound

In the context of the above, combined with our online real cases, we will share some performance data that readers may be interested in:

Memory overhead: There are hundreds of services in the system, using consistent hashing, and the memory usage of the envoy sidecar is about two or three hundred megabytes.
CPU overhead: The typical CPU overhead is related to the fan-out situation. For example, if a service accesses other gRPC services more often, the CPU overhead of envoy will even exceed that of the main business process, but the overhead of envoy is lower when the business process has less fan-out.

For the problem of memory overhead, the community has a relatively clear solution. Using sidecar crd to limit the xds information of the target service that needs to be loaded into the business can greatly reduce the memory usage. Which target services the business process needs to access can be specified through manual maintenance, static registration, or code generation. We will also do related optimizations later.

Next, let's discuss the problem of CPU overhead with a relatively large amount of space. Let's first look at a performance top example of a large fan-out business:

Illustration: Comparison of large fan-out business process and envoy performance

Seeing the data in the above figure, readers may have the following question: Why does envoy require such a high CPU overhead (71.3% in the figure, far exceeding 43.7% of the business process) when only gRPC forwarding is supported? Regarding this point, after analyzing the flame graph, we did not find any significant abnormality. The main work it does is to do protocol encoding, decoding and routing forwarding, and there are no obvious abnormal hot spots.

Illustration: envoy flame diagram

Now there is a lot of information about envoy on the Internet, and they basically say that its performance is better. Could it be that... the truth is that envoy is actually not efficient enough? In response to this question, the author cannot give a clear answer here. Envoy does use a lot of C++ abstractions internally, and there are many call levels, but this may not be the key to the problem, because:

It may be due to the performance drag of the libnghttp2 protocol parsing library used by envoy...
It may be that envoy's "posture" using libnghttp2 is not right, and its performance is not fully utilized...
Or does http2 parsing, encoding and decoding, and sending and receiving packets consume so much CPU?

Regarding the last point, we observed the grpc thread in the main business process, which also needs to do http2 parsing and encoding and decoding, but its cpu overhead is obviously much lower.

Illustration: The grpc thread in the business process is still much smaller than envoy after %cpu * 2

Multiply the grpc thread (red box part) %cpu in the business process by 2 and then compare it with envoy (blue box part), because the workload intercepted by envoy for outbound is approximately doubled compared to the business process, for example, in terms of codec For a req and rsp, the business process is the encoding of req and the decoding of rsp, but for envoy, it is the decoding + encoding of req, and the decoding + encoding of rsp.

Illustration: Envoy vs. Encoding and Decoding Overhead of Business Processes

From the above example, grpc's own performance of http2 parse + encoding and decoding + sending and receiving packets is much better than envoy using libnghttp2, but it is obviously impossible for envoy to directly use the relevant code in grpc. In order to better answer the question about the performance of envoy as a grpc communication proxy, I am afraid that more detailed analysis, demonstration and testing are needed (interested or experienced readers are welcome to communicate).

In short, we have not thought of a good optimization solution for the problem of excessive consumption of sidecar cpu in the gRPC large fan-out business. Of course, the above case is relatively extreme, because it is a large fan-out + the main business process is C++. If it is a small fan-out, the sidecar will not consume much CPU, and if it is a golang business process, the proportion of the sidecar to the overall CPU overhead of the pod is not large. It will be so exaggerated (of course, this in turn shows that the gap between golang performance and C++ is still quite large...).

For us, the overall performance of the grid is not unacceptably serious:

First of all, many businesses are not fan-out, and the overhead of sidecars under this type of business is not large.
Secondly, for the business process of golang class, the percentage of increase brought by sidecar will be smaller.
Finally, compared with the extensive deployment method of traditional IDC, after we have done the overall cloud deployment, it still saves more machines on the whole.

Proprietary Protocol or Private Grid

As mentioned above, the performance problem of envoy is indeed difficult to ignore in the large fan-out business scenario.

If the corresponding performance overhead is really unacceptable, then maybe a private protocol or a private grid will be an optional alternative.

It adopts a private protocol, writes filters based on envoy, parses private protocol headers, and then provides services in combination with envoy xds-related capabilities (you can refer to Tencent's open source solution https://github.com/aeraki-mesh/aeraki ), it is not difficult Imagine that under this scheme, there is no need to parse http2 at all, and the performance will definitely be significantly improved. But if we really go to the old road of private protocol, in fact, it means giving up the gRPC ecology again. (Refer to the previous grid core selling point 1)
By adopting a private grid and realizing the series of capabilities corresponding to xDS, you can start with the most core capabilities. However, if this solution is adopted, it will return to the problem of multi-language support, and it is necessary to implement corresponding capabilities for both C++ and golang (refer to the previous grid core selling point 2). If you really want to implement your own private grid, you should consider that the language-related sdk code is relatively simple in design, and the control plane functions such as routing policies are still in the self-developed sidecar/agent, and the data plane logic is for performance considerations. It is handled by the business process itself.

Outlook for future trends

Enjoy your own practice

For the joyful team, we will continue to do more in-depth practice later. For example, envoy filter development, k8s crd, and the practice of more capabilities of istio (as mentioned above, we currently only use a small part of grid capabilities, and we expect to use capabilities such as circuit breakers and current limiting to improve business availability in the future. ).

Fusion of ebpf

ebpf may be better integrated with container networks and grids in the future, which can improve network-related performance, and may also bring some other possibilities.

proxyless mesh

Proxyless mesh can be seen as an extension based on the performance issues discussed above, and is somewhat similar to the aforementioned private mesh. This type of solution will also have a corresponding living space, because some teams cannot accept the performance overhead brought by data plane sidecars:

Latency has also been mentioned by many teams, but if it is an ordinary Internet business, the author personally thinks that a delay of several tens of milliseconds has little impact.
The cpu and memory overhead have been discussed a lot earlier.

Proxyless mesh is actually a solution of sdk + mesh topology, and gRPC is also continuing to improve its support for xDS, so it is also possible to use the capabilities of gRPC to achieve it. If you develop an sdk that supports xDS by yourself, there is still a requirement for the team's investment, unless the team itself is a middleware team of a large factory (NetEase Qingzhou, Baidu Service Grid, Ali dubbo, have been doing proxyless mesh in the past two years). practice).

Graphic: proxyless gRPC mesh

Private scheme benchmarking xDS

For a team with a unified programming language, such as all golang, only one set of SDKs related to service governance needs to be maintained (the control plane logic can also be carried by an agent), so it may be inclined to make a set of its own private solutions. As far as the author understands, station B used this scheme before, referring to eureka to realize the name service to do the traffic scheduling by itself.

Now with the popularity of grids (at least the corresponding concept is widely known), private solutions can also refer to various features of benchmarking xDS. Because private solutions are usually self-developed, they can theoretically provide relatively efficient and controllable implementations, but require continuous investment and maintenance by the team.

Dapr runtime

The concept is good and the story is big, but it's too early to tell.

References

Let istio support private protocols: [ https://github.com/aeraki-mesh/aeraki ]

grpc support for xDS: [ https://grpc.github.io/grpc/cpp/md_doc_grpc_xds_features.html ]

proxyless grpc: [ https://istio.io/latest/blog/2021/proxyless-grpc/ ]

nghttp2 parsing library: [ https://nghttp2.org/ ]

InfoQ Basic Software Innovation Conference Microservice Session: 【 https://www.infoq.cn/video/7RLecjvETz3Nt7HedviF 】

xresloader configuration conversion tool: [ https://github.com/xresloader/xresloader ]

Dapr: [ https://dapr.io/ ]

about us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

④ Reply to [Introduction to the Speed of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

Wonderful Sharing | Happy Game Istio Cloud Native Service Grid Three-Year Practical Thinking