腾讯云 - Large-scale service mesh performance optimization | Aeraki xDS on-demand loading - 个人文章

author

Zhong Hua, Tencent Cloud expert engineer, Istio project member, contributor, focuses on containers and service grids, has rich experience in containerization and service grid production, and is currently responsible for Tencent Cloud Mesh research and development.

Istio xDS performance bottleneck in large-scale scenarios

xDS is the communication protocol between the istio control plane and envoy on the data plane. x represents a collection of multiple protocols, such as: LDS for listeners, CDS for services and versions, EDS for instances of services and versions, and each service The characteristics of the instance, RDS stands for routing. You can simply understand xDS as a collection of service discovery data and governance rules in the grid. The size of the xDS data volume and the grid size are positively correlated.

Currently, istio uses a full-volume distribution strategy to distribute xDS, that is, for all sidecars in the grid, there will be all service discovery data in the entire grid in the memory. For example, in the figure below, although workload 1 only relies on service 2 in business logic, istiod will send the full amount of service discovery data (service 2, 3, 4) to workload 1.

The result of this is that the memory of each sidecar will grow as the grid size grows. The following figure shows a performance test of the grid size and memory consumption. The x-axis is the grid size, that is, how many service instances are included. , The y-axis is the memory consumption of a single envoy. It can be seen that if the grid size exceeds 10,000 instances, the memory of a single envoy exceeds 250 megabytes, and the overhead of the entire grid must be multiplied by the grid size.

Istio's current optimization plan

In response to this problem, the community provides a solution, which is Sidecar . This configuration can explicitly define the dependencies between services, or visibility relationships. For example, the configuration in the figure below means that workload 1 only depends on service 2. After this configuration, istiod will only issue service 2 information to workload 1.

The program itself is effective. However, this approach is difficult to implement in large-scale scenarios: First, this solution requires users to configure the complete dependencies between services in advance. It is difficult to sort out the service dependencies in large-scale scenarios, and usually the dependencies will vary with the business. Change and change.

Aeraki Lazy xDS

In response to the above problems, the TCM team designed a non-invasive xDS on-demand loading solution and open sourced it to the github Aeraki project. This is the specific implementation details of Lazy xDS:

We added two components to the grid, one is Lazy xDS Egress, which acts as the default gateway role in a similar grid model, and the other is Lazy xDS Controller, which is used to analyze and complement the dependencies between services.

First, configure the service transfer capability of Egress: Egress will obtain all service information in the grid and configure the routing of all HTTP services, so that Egress, which acts as the default gateway, can forward the traffic of any HTTP service in the grid.
The second step is to use envoyfilter to route the traffic that accesses the http service in the grid to the egress for the service that has the on-demand loading feature (Workload 1 in the figure) turned on.
The third step is to use istio sidecar CRD to limit the visibility of Workload 1 services.
After step 3, Workload 1 will initially only load the minimized xDS.
When Workload 1 initiates access to Service 2, (because of step 2) the traffic will be forwarded to Egress.
(Because of step 1) Egress will analyze the received traffic characteristics and forward the traffic to Service 2.
Egress will asynchronously report the access log to Lazy xDS Controller. The report service uses Access Log Service .
Lazy xDS Controller will analyze the access relationship of the received logs, and then express the new dependency relationship (Workload 1 -> Service 2) into the sidecar CRD.
At the same time, the Controller will also remove (step 2) the rule that Workload 1 needs to forward Service 2 traffic to Egress, so that in the future, workload 1 will access Service 2 directly.
(Because of step 8) Istiod updates the visibility relationship, and subsequently sends the service information of Service 2 to Workload 1.
Workload 1 receives the service information of Service 2 through xDS.
When Workload 1 initiates access to Service 2 again, the traffic will go directly to Service 2 (because of step 9).

The benefits of this program:

First of all, there is no need for users to configure the dependencies between services in advance, and the dependencies between services are allowed to be dynamically increased.
In the end, each envoy will only get the xDS it needs, with the best performance.
This implementation also has a relatively small impact on user traffic, and user traffic will not be blocked. The performance loss is also relatively small. Only the first few requests will be transferred in Egress, and the latter will be directly connected.
This solution does not have any intrusion to istio and envoy, and we have not modified the source code of istio/envoy, so that this solution can be well adapted to future iterations of istio.

At present, we only support the on-demand loading of Layer 7 protocol services, because when traffic is transferred here, Egress needs to determine the original destination through the header in the Layer 7 protocol. With pure TCP protocol, there is no way to set additional headers. However, because the main purpose of istio is to manage the seven-layer traffic, when most of the grid requests are for the seven-layer, this situation is currently acceptable.

Lazy xDS performance test

Test program

In different namespaces in the same grid, we created 2 sets of book info. The productpage in the namespace lazy-on on the left is turned on on-demand loading, and the namespace lazy-off on the right is left as default.

Then in this grid, we gradually increase the number of services, using istio's official load test tool set (hereinafter referred to as "load service"). There are 19 services in each namespace, including 4 tcp services and 15 For http service, the initial number of pods for each service is 5, a total of 95 pods (75 http, 20 tcp). We gradually increase the number of namespaces for load services to simulate the growth of grid scale.

Performance comparison

The first is the comparison between CDS and EDS. In the figure below, each group of data represents the increase in the load service namespace. There are 4 values in each group of data: the first 2 values are the CDS and EDS after the on-demand loading is turned on, and the last 2 values are not turned on. CDS and EDS loaded on demand.

Next is the memory comparison. The green data indicates the memory consumption of Envoy after the on-demand loading is turned on, and the red one is not turned on. With a mesh of 900 pods, the memory of envoy is reduced by 14M, a reduction of about 40%; for a mesh of 10,000 pods, the memory of envoy is reduced by about 150M, a reduction of about 60%.

With the limitation of service visibility, envoy will no longer receive the full amount of xDS updates. The figure below is a comparison of the number of CDS updates received by envoy during the test cycle. After the on-demand loading is turned on, the number of updates has been reduced from 6,000 to 1,000 Second-rate.

summary

Lazy xDS has been open sourced on github, please visit lazyxds README learn how to use it.

Lazy xDS functions are still evolving. In the future, we will support functions such as multi-cluster mode and ServiceEntry on-demand loading.

If you want to know more about Aeraki, please visit the Github homepage: https://github.com/aeraki-framework/aeraki

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Large-scale service mesh performance optimization | Aeraki xDS on-demand loading

author

Istio xDS performance bottleneck in large-scale scenarios

Istio's current optimization plan

Aeraki Lazy xDS

Lazy xDS performance test

Test program

Performance comparison

summary

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

火热报名中| 第五届Light创造营邀你一起破茧成光！

可能是腾讯游戏首次接入满血版 DeepSeek——知几公众号对接腾讯云 DeepSeek 方案

2025版 RTC、直播、点播技术对比｜腾讯云/即构/声网如何选型

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

推理模型升级浪潮下，Agentic RAG 如何借力 DeepSeek 实现知识革命？

信息安全风云录，AI 时代安全江湖如何见招拆招？