1. Introduction
NetEase Shufan Qingzhou microservice team started to use Istio as a service mesh very early. In practice, we have developed many Istio peripheral modules to facilitate the use of Istio by ourselves and our customers within the NetEase Group. In order to give back to the community, we have systematically organized these modules and selected some of them to open source the Slime project in early 2021.
The Slime project aims to solve the pain points in the use of Istio, facilitate users to use the advanced functions of Istio, and always adhere to the principle of seamlessly connecting Istio without any customization , which greatly reduces the threshold for use.
Over the past year, Slime has made many changes and attempts in architecture, functions, and engineering, and has been greatly improved. In December 2021, Slime was invited to join the Istio ecosystem and officially became a member of the Istio Ecosystem - integrations .
Today, this article will introduce the main capabilities of Slime at this stage, mainly lazy loading and intelligent current limiting modules, and look forward to the future development, hoping to let more ServiceMeshers understand Slime, participate in Slime, and use service meshes more easily. .
Second, lazy loading
2.1 Background
Istio's full push performance problem is a problem that all Istio users have to face.
As we all know, the early Istio configuration distribution was very rough, and it was directly pushed in full. This means that with the continuous expansion of the business scale in the service grid, more and more content needs to be delivered by the control plane, and more content needs to be received by the data plane, which will inevitably bring about performance problems. There are often multiple business systems in a cluster. The application of a business system is aware of the configuration of all business systems, which means that it is unreasonable to push a large number of redundant configurations. As shown on the left side of the figure below, A is only related to B, but has been pushed the configuration of C and D. Another problem is that the frequency of pushes will be high. When a service changes, the control plane notifies all SidecarProxy on the data plane.
Therefore, Istio version 1.1 provides a solution - Sidecar CRD (this article will call it SidecarScope, to distinguish it from the SidecarProxy implemented by Envoy). Users can describe the service information that SidecarProxy needs to care about in SidecarScope, thereby shielding the distribution of irrelevant service configurations. The result is shown on the right side of the figure above. After the service is configured with SidecarScope, the received configuration is simplified and no longer contains irrelevant configurations. At the same time, configuration changes of unrelated services will no longer be notified, reducing the push frequency.
A typical SidecarScope example is as follows, the example indicates that the matching SidecarProxy is allowed to perceive all the services in Namespace prod1 and istio-system and the ratings service configuration information in Namespace prod2.
apiVersion: networking.istio.io/v1alpha3
kind: Sidecar
metadata:
name: default
namespace: prod1
spec:
egress:
- hosts:
- "prod1/*"
- "prod2/ratings.prod2.svc.cluster.local"
- istio-system/*
The SidecarScope provided by Istio can solve the problem of full configuration distribution. It seems that the problem has been solved. But in practice, manually managing SidecarScope is difficult. On the one hand, the information that the service depends on is not easy to organize. On the other hand, once the configuration is wrong, it will cause problems in the call. This is very unfavorable for the large-scale implementation of service meshes. We desperately want to be able to manage SidecarScope more intelligently.
2.2 Value
Lazy loading modules are used to solve the above problems. Lazy loading can automatically connect to the service mesh, support all Istio network governance capabilities during the forwarding process, and have no performance problems . It helps business people use SidecarScope without having to manage it directly.
We believe that service dependencies can be divided into dynamic service dependencies that are constantly changing during operation and static service dependencies that business personnel can know in advance. For dynamic dependencies, we designed a mechanism to obtain service dependencies in real time and modify SidecarScope; for static dependencies, we focused on simplifying the configuration rules to make them more user-friendly.
2.3 Dynamic configuration update
Lazy loading includes two components, Global-sidecar and Lazyload Controller.
- Global-sidecar: A sidecar component. When the source service cannot find the target service, it will forward the sidecar and generate the corresponding service dependency metric.
- Lazyload Controller: Control component, process metrics reported by Global-sidecar, modify SidecarScope of source service, and add corresponding configuration to it
The simplified dynamic configuration update process is as follows
- The SidecarScope of service A is initially blank, and there is no configuration information for service B
- Service A initiates the first access to service B. Since the SidecarProxy of service A does not have the configuration information of service B, the request is sent to the bottom component Global-Sidecar
- The bottom component Global-Sidecar has full service configuration information, naturally including service B, forwards the request to service B, the first request is successful, and generates Metric (A->B)
- Lazyload Controller perceives Metric(A->B), modifies SidecarScope A, and adds the configuration of service B to it
- When service A accesses service B for the second time, the SidecarProxy of service A already has the configuration of service B, and the request goes directly to service B
The detailed flow chart is as follows
Among them, ServiceFence is a CRD introduced in lazy loading. Its function is to store metrics related to services and to update SidecarScope. For a detailed introduction, please refer to Lazy Loading Tutorial - Architecture
2.4 Static configuration enhancements
In the early days of lazy loading, we focused on obtaining dynamic service dependencies, which seemed smart and worry-free. However, in practice, we found that many users, for security reasons, often want to configure some rules directly into SidecarScope, that is, configure static service dependencies. So, we began to think about how to flexibly configure static dependencies.
So, we designed a set of very useful static rules and wrote them into ServiceFence (yes, the CRD used to store Metrics in dynamic configuration updates, it played a new role here). Then Lazyload Controller updates the corresponding SidecarScope according to these rules.
Now we provide three types of static configuration rules:
- Depends on some Namespace services
- Depends on all services with certain labels
- depends on a specific service
Here is an example of label matching. If the application is deployed as shown in the following figure
Now enable lazy loading for the service productpage, which has a known dependency rule of
- All services with label
app: details
- All services with labels
app: reviews
andversion: v2
Then the corresponding ServiceFence is written as follows
---
apiVersion: microservice.slime.io/v1alpha1
kind: ServiceFence
metadata:
name: productpage
namespace: default
spec:
enable: true
labelSelector: # Match service label, multiple selectors are 'or' relationship
- selector:
app: details
- selector: # labels in one selector are 'and' relationship
app: reviews
version: v2
The Lazyload Controller will populate the SidecarScope according to the actual matching result. The actual SidecarScope is as follows, all the green services in the above picture are selected
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: productpage
namespace: default
spec:
egress:
- hosts:
- '*/details.ns1.svc.cluster.local'
- '*/details.ns2.svc.cluster.local'
- '*/details.ns3.svc.cluster.local'
- '*/reviews.ns2.svc.cluster.local'
- istio-system/* # istio部署的ns
- mesh-operator/* # lazyload部署的ns
workloadSelector:
labels:
app: productpage
Finally, we don't have to repeatedly confirm whether all service dependencies are filled in before going online, let alone manually modify SidecarScope when service dependencies change. Configure two or three ServiceFence rules and you're done.
For a detailed introduction, please refer to Lazy Loading Tutorial - Static Service Dependency Addition
2.5 Metric types
In Section 2.3, we explained that metrics are fundamental to dynamic dependency generation. Currently, there are two Metric types supported by lazy loading: Prometheus and AccessLog.
Using Prometheus Mode, metrics are generated by SidecarProxy of each business application. Lazyload Controller queries Prometheus for metrics. This mode requires the service mesh to interface with Prometheus.
In AccessLog Mode, the indicator source is the AccessLog of Global-sidecar. Global-sidecar will generate a fixed-format AccessLog while forwarding the information, and send it to the Lazyload Controller for processing. This mode requires no external dependencies and is more portable.
2.6 Mode of use
There are two usage modes for lazy loading modules, Namespace mode and Cluster mode. In the two modes, the Lazyload Controller is globally unique. The difference is that the former's Global-sidecar is at the Namespace level, and the latter is at the Cluster level. As shown below
For N namespaces, the number of lazy loaded components is O(N) in Namespace mode and O(1) in Cluster mode. Now we prefer to use Cluster mode. As shown in the figure above, each cluster only needs to deploy two Deployments, which is concise and clear.
For a detailed introduction, please refer to Lazy Loading Tutorial - Installation and Use
3. Intelligent current limiting
3.1 Background
With Istio removing Mixer, implementing throttling in a service mesh has become difficult.
- Few scenarios: Envoy's local current limiting component has simple functions, and cannot achieve high-level usages such as global equalization and global sharing current limiting
- Complex configuration: Local current limiting requires the help of Envoy's built-in plug-in
envoy.local.ratelimit
, users have to face complex EnvoyFilter configuration - Fixed conditions: There is no ability to automatically adjust the current limiting configuration according to actual conditions such as resource usage
3.2 Value
To solve this problem, we have introduced an intelligent current limiting module . Smart current limiting modules have many advantages, specifically
- Multiple scenarios: support local current limiting, global sharing current limiting, global sharing current limiting
- Easy configuration: simple configuration, good readability, no need to configure EnvoyFilter
- Conditional adaptation: The conditions triggered by the current limit can be dynamically calculated in combination with Prometheus Metric to achieve adaptive current limit effect
3.3 Implementation
We design a new CRD - SmartLimiter with configuration rules close to natural semantics. The logic of the module is divided into two parts
- SmartLimiter Controller obtains monitoring data and updates SmartLimiter CR
- SmartLimiter CR to EnvoyFilter conversion
The current limiting module architecture is as follows
Red is local throttling, green is global share throttling, and blue is global shared throttling. For a detailed introduction, please refer to the Intelligent Current Limiting Tutorial - Architecture
3.4 Local current limiting
Local current limiting is the most basic usage scenario. SmartLimiter sets a fixed current limit value for each Pod of the service. The bottom layer relies on Envoy's built-in plugin envoy.local.ratelimit
. The identification field is action.strategy: single
.
An example is as follows, which means that the 9080 port of each Pod of the reviews service is limited to 60 times per minute.
apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
name: reviews
namespace: default
spec:
sets:
_base: # 匹配所有服务,关键词 _base ,也可以是你定义的 subset ,如 v1
descriptor:
- action: # 限流规则
fill_interval:
seconds: 60
quota: '100'
strategy: 'single'
condition: 'true' # 永远执行该限流
target:
port: 9080
3.5 Global share current limit
The global average limit function is based on the total current limit set by the user, and then evenly distributes it to each Pod. The bottom layer relies on Envoy's built-in plugin envoy.local.ratelimit
. The identification field is action.strategy: average
.
An example is as follows, which means that the 9080 port of all Pods in the reviews service is throttled 60 times per minute. The number of current limiting times for each Pod is calculated by action.quota
.
apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
name: reviews
namespace: default
spec:
sets:
_base:
descriptor:
- action:
fill_interval:
seconds: 60
quota: '100/{{._base.pod}}' # 如果reviews实例数是2,则每个Pod限流每分钟50次
strategy: 'average'
condition: 'true'
target:
port: 9080
3.6 Global Shared Current Limit
The global shared current limit limits the total number of accesses of all pods of the target service. It is not limited to the average value like the global share current limit, and is more suitable for scenarios with uneven access. This scenario will maintain a global counter, and the bottom layer relies on the Envoy plugin envoy.filters.http.ratelimit
and the global counter capability provided by the RLS service. The identification field is action.strategy: global
.
An example is as follows, which means that the 9080 port of all Pods in the reviews service is limited to 60 times per minute, and is not evenly distributed to each Pod.
apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
name: reviews
namespace: default
spec:
sets:
_base:
#rls: 'outbound|18081||rate-limit.istio-system.svc.cluster.local' 如果不指定默认是该地址
descriptor:
- action:
fill_interval:
seconds: 60
quota: '100'
strategy: 'global'
condition: 'true'
target:
port: 9080
3.7 Adaptive current limiting
In the above three scenarios, the condition field that triggers the current limit condition
can be not only a fixed value (true), but also the calculation result of Prometheus Query. The latter is adaptive current limiting. This scenario and the above three scenarios are cross-relationships.
Users can customize the monitoring indicators to be obtained, for example, define a handler cpu.sum
, whose value is equal to sum(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$pod_name",image=""})
, and then set the trigger current limit condition
to {{._base.cpu.sum}}>100
to achieve adaptive current limiting.
An example is as follows, indicating that the 9080 port of each Pod in the reviews service is limited to 60 times per minute only when the CPU usage value is greater than 100. Compared with the example in 3.4, condition
is no longer always true. Whether the current limit is triggered is judged by the SmartLimiter Controller according to the actual state of the application, which is more intelligent.
apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
name: reviews
namespace: default
spec:
sets:
_base: # 匹配所有服务,关键词 _base ,也可以是你定义的 subset ,如 v1
descriptor:
- action: # 限流规则
fill_interval:
seconds: 60
quota: '100'
strategy: 'single'
condition: '{{._base.cpu.sum}}>100' 如果服务的所有负载大于100,则执行该限流
target:
port: 9080
Fourth, the project structure
This section briefly introduces the project architecture of Slime to help you understand the code repository and deployment form in the multi-module scenario of Slime. The architecture is shown in the figure below
Slime's project architecture adheres to the "high cohesion, low coupling" design philosophy, including three parts
- Modules: Independent modules that provide a certain function, such as lazy loading and intelligent current limiting belong to Modules
- Framework: The base of modules, providing basic capabilities required by Modules, such as log output and monitoring indicator acquisition
- Slime-boot: The startup component responsible for pulling up the Framework and the specified Modules
The entire code repository is divided into 1+N. Slime-boot and Framework are located in the main slime warehouse slime-io/slime , and modules such as lazy loading are all located in independent warehouses.
The deployment form is also 1+N, that is, a Slime Deployment contains a common Framework and N modules that users want. The advantage is that no matter how many Slime modules are used, it is a Deployment when deployed, which solves the maintenance pain points of too many microservice components.
V. Outlook
Slime has been open source for more than a year. In addition to the addition of new functions at the module level and the improvement of existing functions, it has also undergone a major architectural adjustment and reconstruction of the Metric system. It can be said that the development of Slime is now on the right track. new stage. Future planning can be divided into the improvement of existing modules and the introduction of new modules, which are described in detail below.
5.1 Lazy Loading Planning
characteristic | Feature Description | nature | Scheduled release time |
---|---|---|---|
disaster recovery capability | Modify the Global-Sidecar component to improve its bottom-up capabilities, so that lazy loading can be used in some disaster recovery scenarios | certainty | 2022.Q2 |
Multi-Service Registry Support | Lazy loading is currently mainly adapted to Kubernetes scenarios, and plans to support ServiceEntry to adapt to multi-service registry scenarios | certainty | 2022.Q2 |
More flexible static configuration | Through higher-dimensional abstraction, the automatic configuration of ServiceFence is realized, and more advanced static rules are supported | certainty | 2022.Q3 |
Multi-protocol lazy loading | Lazy loading currently supports Http services, and plans to support lazy loading of other protocol services, such as Dubbo, etc. | exploratory | 2022.H2 |
Lazy loading across clusters | Lazy loading currently supports services in the same cluster, and plans to support lazy loading of cross-cluster services in multi-cluster service grid scenarios | exploratory | 2022.H2 |
5.2 Intelligent Current Limiting Planning
characteristic | Feature Description | nature | Scheduled release time |
---|---|---|---|
Multi-Service Registry Support | Intelligent current limiting is currently mainly adapted to Kubernetes scenarios, and plans to support ServiceEntry to adapt to multi-service registry scenarios | certainty | 2022.Q2 |
Outgoing flow limiter | Intelligent current limiting currently supports inbound traffic limiting, which can meet most scenarios, but in terms of capability completeness, it is planned to support outbound traffic limiting. | certainty | 2022.Q3 |
Multi-protocol intelligent current limiting | Smart current limiting currently supports Http services, and plans to support smart current limiting for other protocol services, such as Dubbo, etc. | exploratory | 2022.H2 |
Intelligent current limiting across clusters | Intelligent current limiting currently supports same-cluster services, and plans to support cross-cluster intelligent current limiting in multi-cluster service grid scenarios | exploratory | 2022.H2 |
5.3 New Module Planning
Module name (planning) | Module description | nature | Scheduled release time |
---|---|---|---|
IPerf | A performance testing tool set specially built for Istio, integrating the Istio testing framework, adding custom test cases, and intuitively comparing the performance changes of different versions | certainty | 2022.H2 |
Tracetio | Full-link automatic operation and maintenance of service grid, improve troubleshooting efficiency, and provide intelligent judgment | certainty | 2022.H2 |
I9s | Similar to K9s, a black-screen, half-command-line, half-graphical O&M tool for service mesh scenarios | certainty | 2022.H2 |
I hope the above plans can meet with you as soon as possible. Slime related information can be found in Slime - Home , and you are welcome to communicate with us.
Related Reading
- Slime: Making Istio Service Meshes More Efficient and Smart
- Slime project address: https://github.com/slime-io/slime
- Overview of NetEase Shufan open source project: https://sf.163.com/opensource?tab=opensource
About the author: Wang Chenyu, NetEase Shufan senior server development engineer, Istio community member, Slime Maintainer, familiar with Istio and Kubernetes, mainly responsible for the design and development of Slime and NetEase Shufan Qingzhou service grid, with years of practice in cloud native related fields experience.
IstioCon 2022 , the second global summit of the Istio community, will be held online. 80+ service mesh technology experts from Google, NetEase, IBM, Tencent and other companies will bring 60+ technical sharing sessions. At 10:40 on April 28, Yonka Fang, senior architect of NetEase Shufan , will share the "Istio Push Performance Optimization Experience" for developers and users around the world. It is not to be missed. Scan the code to make an appointment for the live broadcast! Link: https://www.crowdcast.io/e/istiocon-2022/register
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。