At IstioCon2022, Fang Zhiheng, senior architect of NetEase Shufan, shared years of Istio practical experience from the perspective of enterprise production and implementation, introduced the Istio data model, the relationship between xDS and Istio push, the performance problems encountered by NetEase Shufan and the optimization methods. experience, and some related Tips.
data model
From the perspective of push, what Istio does, analogous to the process of cooking, can be roughly divided into the following parts:
The first is "preparation". Istio will connect, convert, and aggregate various service registries, and uniformly convert the data of different service models into service model data within Istio. In the early Istio implementation, this is a defined interface, which is implemented at the code level. Users can implement and connect the registry, but this method forms an intrusion into the Istio code, which is more convenient for developers rather than users, so The subsequent evolution of Istio will discard it. Instead, it defines a data model, the data structure of the API such as ServiceEntry, and a corresponding MCP protocol. If there is a need for expansion and external integration, this protocol can be implemented in an external component alone. , the service model value is converted and then transmitted to Istio. Istio can be connected to multiple registration centers and service centers through configuration. Because service data can be considered as the most basic element in the entire service mesh, we liken this step to the process of preparing food.
Followed by "adding, cooking". This may be the part that users are most familiar with. Service discovery is a basic function of service meshes, but Istio's advantages in the entire service mesh are more reflected in the rich and flexible governance capabilities it supports. , so this is actually applying the governance rules defined by Istio, we can compare it to the process of adding ingredients and cooking.
Once again, it is "loading and placing a plate". We have already obtained the final xDS data to be pushed, but there is still a lot of code in Istio to adapt to various deployment and network usage scenarios, which will affect some characteristics of the final generated data, which is similar to loading and arranging disks. the process of.
The last step is "cooking". We push the final data to the data plane in the form of xDS configuration. This is the "language" that the data plane can understand. We compare this process to the process of serving dishes.
Generally speaking, the "dishes" we serve on the table are completely different from the "ingredients" prepared before, and it is difficult for you to understand the original appearance. The configuration of xDS, as students who have seen it, know that it is actually very complicated.
Another example, I would like to compare it to the reconcile process of the Kubernetes model, which is basically a "watch-react-consistency" loop, the complexity of the whole process, with the number and variety of input and output resources, is almost increased exponentially.
Below I have sorted out the relationship between Istio's upstream and downstream resources. The downstream refers to xDS, and the upstream is some data resource types defined by Istio to support rich governance capabilities and scene adaptation. It can be seen that each resource will be affected by many resource types, even including some features like proxy, which we consider to be a very runtime data, which will affect the final generated CDS, so say The final effect should be "thousands of people, thousands of faces", that is to say, from the perspective of push, we hope that different characteristics affect which resources can be obtained, not that the content of the same resource obtained is different. This method is very unfavorable for push and optimization.
xDS and Istio push
Take a brief look at the xDS protocol. From the perspective of push, it is divided into three categories. The first category is StoW (state-of-the-world), which is currently the default and main mode of Istio. It supports eventual consistency and real-time computing ( It is actually a gesture of Istio's use of it) and full push, which is characterized by simplicity and robustness, and easy maintenance, because it will recalculate all data pushes every time, so we don't need to care too much about its data when we are doing function development. Consistency or data loss, but it also comes at a price, which is poor performance.
The second category is delta xDS. The well-known delta generally has some preconditions, so it will be more convenient to implement. For example, versioning various global resources (such as GVK + NamespacedName + version), and pushing the newly added difference parts according to the offset when pushing, this is a typical delta implementation idea. But for the current Istio data model, it is impossible to generate such a more thorough delta mode. At present, the implementation of delta in the community that I have learned is actually a downgrade, or does real-time calculation for each proxy, and caches the enumerable feature values. If the features of the two proxies are exactly the same, then they The configuration data obtained should be the same. This idea requires that the eigenvalues be enumerable. At present, only a few resource types (CDS, EDS) and scenarios (only ServiceEntry changes) can be implemented, otherwise it will go back to full push - delta xDS When designing, consider In this scenario, it supports more content to be distributed than to be subscribed.
The last category is on demand xDS, which Istio has not yet started using. Its general idea is that Envoy will only make requests when it actually uses a certain resource. It has a preset premise that most proxies will not be used in the distribution configuration. Whether this premise can be established, I will I don't think it's particularly certain, and it seems to only support VHDS at the moment.
Let's go back to Istio's perspective. There are actually only two kinds of pushes, one is called non-full push and the other is called full push. Non-full push is a type of push where only endpoints have changed. It can only do EDS push. The characteristics of EDS push can push only those changed clusters, and only push to those proxies that have watched these changed clusters. From this point of view, non-full push, or EDS push, is closer to ideal push. But there is a scenario that our cluster is very large, for example, there are 10,000 endpoints. At this time, we only change a few endpoints. We will still change the cluster of these 10,000 endpoints because of the changes of these few endpoints. Reissue.
This scenario may not be particularly common depending on the size of the enterprise, and it does a very good job compared to full push, because all types of changes in the latter will trigger full push, not only re-push all types of data, but every type of data All are pushed in full, and they are pushed to all proxies. Of course, this is its initial form. Istio is constantly optimizing in the process of continuous evolution, and has introduced some scoping mechanisms. The general idea is that it will try to push only the full content required by each proxy, and only Push to those proxies affected by the current change. This sentence may sound simple, but it is actually very difficult, because we have also analyzed the very complex or flexible capabilities in Istio, which determine the relationship between its upstream and downstream configurations. It is very difficult to understand.
Next, I will talk about the 3Ws of Istio push. The first one is when, which is actually an event-driven type. It will only do some pushes in the scene of external changes. The idea is the aforementioned eventual consistency, real-time calculation and full push, with a small number of active trigger scenarios.
The second is who. The control plane will try to determine which proxies will be affected by this change. The following lists which mechanisms Istio uses to determine which proxies will be affected by the changed config or service in the implementation, which can be judged by type. It can be judged by specific and clear resources, as well as the types of upstream and downstream. Judging by type is relatively rough, and requires a certain maintenance cost, because a certain type of push, such as the change of the CR of AuthorizationPolicy, will not affect the CDS. This is a conclusion based on the current implementation. If Istio introduces new features in the subsequent evolution, logical changes may lead to changes in dependencies, which may be missed, so there are certain limitations.
Another way to clarify the dependencies of resources is more clear. At present, it is actually compared with SidecarScope.configDependencies through PushRequest.ConfigUpdated, and then determines whether the sidecar is affected by this configure, but it also has some limitations at present, only It supports sidecar-type proxy, and is limited to four resources such as service, vs, dr, and sidecar that can be constrained by sidecar. Another thing that NetEase Shufan is currently doing is to make configDependencies at the proxy level, so that it can cover almost all resource types.
The last one is what, that is, the control plane tries to determine the content that needs to be updated, and whether the proxy needs it, or what the proxy needs, because the content that the proxy needs is a downstream configuration xDS, and the content that needs to be updated is also this. This affects the amount of data pushed. First determine which proxies need to be affected and which need to be pushed, and then determine which ones need to be pushed. This is described by some dependencies such as resources and workloads. There are some resources. Workload selector is supported, and some resources are constrained by Sidecar.
Encountered performance issues and optimization experience
After introducing the background, I will mainly share the performance problems experienced by NetEase Shufan and the corresponding optimization experience.
MCP (over-xDS) performance issues
First of all, the first problem is the MCP (over-xDS) performance problem. We say that MCP generally refers to the old version of the MCP protocol, because the new version of the protocol has been replaced by xDS, but the data structure in the MCP protocol is retained, but the name Couldn't change it for a while. This performance problem is mainly due to its push mode, because it is actually an xDS protocol. Istio itself implements this MCP (over-xDS), which means that one of our Istio Pilots can actually use another Istio Pilot as a Its configuration source. This may be an intentional design by the community, and its model is consistent, so there will also be a problem with the StoW model - any type of resource change, even just one change, will result in a full push of the same type of resource. For our scenario, both ServiceEntry and VirtualService are of the order of five digits, and transmission or write amplification is also of the order of five digits, so the overhead is too large.
There are two points in our optimization method. The main point is to support incremental push. Of course, it is the same as the community's idea. We did not implement a complete delta xDS, but only implemented a corresponding semantics, but we did achieve the final effect. Incremental push. It has two main points. One is that the MCP Server must support the annotation of ResourceVersion, and then it can tell some version information to the MCP Client; the other is that the MCP Client must strengthen the support for this annotation. Partial support, but not complete enough. One of the things we will do later is to skip his update when ResourceVersion has not changed, which is equivalent to a continuation of the community's thinking.
Another point is that our MCP Server supports the resource isolation of Istio Revision, because one of the ideas of the current community Revision is the filtering of the client side, which means that I receive all the data first, and then do a filter according to my Revision, so that the transmission The amount of data is still relatively large. After this enhancement, the MCP Server will perform a filter based on the client's Revision, thereby reducing the amount of data transmitted.
Data processing performance issues with ServiceEntryStore
The second is the data processing performance issue of ServiceEntryStore. Simply put, there is a step in it that will update the index of the instance in full, which means that if one of the devices in my service changes, it will update the index of all services, which is also a very large write amplification. , the optimization idea is also relatively close. We did not change the main process of Istio, but made an aggregation of its index update operation on the basis of the original refreshIndexe, and the optimization effect was very obvious. There are actually some refactorings in the code of this community, and we have not confirmed the specific effect.
Another is that servicesDiff does not skip the CreateTime and Mutex fields. The values of these two fields are often different, which leads to inaccurate results. Sometimes only the endpoints are changed, but the result of this comparison will cause it to be upgraded. For service changes, thus non-full push will also be upgraded to full-push. This is mainly an evolution problem. In the latest community code Service, these two fields are either deleted or not assigned, so this problem can be ignored.
"Avalanche effect" of initial loading of data at startup
The third problem is a relatively big one. In short, Istio is making a lot of configuration changes, especially when doing initial loading, it will load all the data, and each new data will be considered a change, which may lead to an avalanche effect. Avalanche effects are generally brought about by positive feedback. There are two scenarios of positive feedback. The first is that each service change will refresh the cache of all services. Our actual services are loaded one by one. The amount of refresh is O (n^2), and the amount of calculation is very large. The optimization method was also mentioned earlier, which is to do an aggregation.
The second scenario is more complicated. The background is mentioned above. Service changes and configuration changes will trigger full-push. The full name is full-config-update, which includes two stages: full-update and full-push. For data update, full-push takes all the affected proxies, then generates data for each proxy, and then pushes it to each proxy. The overhead of these two steps is very large. Istio actually made a jitter suppression for this scene. If it finds high-frequency changes, it will suppress these changes. The effect is quite good, but there is a problem, the suppression of jitter is mainly to suppress instantaneous and concentrated A large number of changes, and if this large number of changes is delayed for some reason and lasts longer, it will cause the design to suppress jitter to fail. If we have such data loading, and when there are other avalanche logics that increase the amount of calculation, there will be very serious CPU contention, because the updates and pushes triggered by Istio's content changes are a strong CPU plus Strong I/O, if it has a proxy to push, CPU contention will make the process slower. These two points are mutually positive feedback, which will lead to more full update and slower. The whole process doesn't stop there. If a proxy is connected at this time, we have to superimpose a full-push overhead proportional to the number of proxies. The whole situation will worsen, and the configuration loading will last longer. In addition, when there is a serious bottleneck, the push to the proxy will time out, and the proxy will disconnect and reconnect, and this process will continue to deteriorate. The last one is the business effect. The loading time is very long, which means that the entire data is loaded later. If the proxy is connected before this, it will get an incomplete configuration, which will cause business damage, because the proxy may be Reconnect, it was fully configured before.
In terms of the data level of NetEase Shufan, if the above optimizations are not done, it will take us ten to twenty minutes or even longer to reach stability, but during this period, the system can be considered unusable. There are several ideas for optimization: one is to optimize the write amplification of endpoint index updates mentioned above; the other is to optimize the logic of the entire system's judgment of ready, because Istio itself has a logic, only when it thinks that the components are ready. Only when the user is marked as ready will they mark themselves as ready, but this is actually problematic; the third is the most important point, we have introduced a change management, the original simple suppression is no longer enough to cover this scenario, So we made a manager similar to push status, and added some start and stop controls on top of inhibition. The final effect is that the startup time of the same scale after our optimization is about 14 seconds, and there is no state misjudgment in the middle, and no business damage.
Numerous service/configuration changes at runtime
There is also a related question, what if there are a lot of service changes at runtime? Although there is not such a high probability of appearing at runtime, it has some conditions that are different. For example, our Pilot is ready at this time, and Envoy is connected. At this time, the challenge is relatively large, and its large-scale changes are generally There are three scenarios. One scenario is that the business has some non-standard releases, such as adding a large number of services in a short period of time, or adding a large number of configurations. At this time, we need to ensure our own robustness; the second is upstream For example, frequent changes and pushes caused by some bugs in the configuration source are relatively rare; the third is the restart of the upstream MCP Server. For the resources in Kubernetes, it has a version number and can be based on the version. The check only pushes the incremental part, but for the converted ServiceEntry, it does not have a persistent version number, which may lead to a large number of fake updates.
The corresponding idea is to deal with it one by one. NetEase Shufan has introduced several optimization methods. One is to do conditional batch processing. For scenarios such as MCP, its interaction method is to push all data of the same type at one time, which means that the amount of change is very large. We will simulate a transaction commit mechanism, and we will disable push during the transaction cycle. In another scenario, if it is a non-batch process, we will add an anti-jitter mechanism to convert continuous changes into batches. Another point is to version resources, and the generated resources introduce a persistent version number in some way, which can reduce unnecessary changes.
Proxy access precedes system ready
There is another scenario. As mentioned above, the sidecar will be connected when our system is not ready. This is because the current Istio internal design to determine whether the component is ready may not be accurate enough. In the case of large-scale data and high pressure will be magnified below. To sum it up in one sentence, there is an asynchronous processing process in the middle. This process is relatively fast under normal circumstances, and the timing loopholes will not be reflected. However, under high pressure conditions, the CPU contention is serious. This asynchronous process The time gap will be magnified, causing Istio to mistakenly think that the entire system is ready in advance.
The way of optimization is to optimize performance and reduce such high-stress scenarios, and the other is to introduce more components ready check mechanism. For example, our upstream MCP, which had a similar scenario before, issued data in advance without strictly judging whether it was ready. At this time, we also made additional checks.
Intensive change problem
The last case is that we will encounter an intensive change, such as initial loading. At this time, the high water level of the CPU will cause the health check of https to fail, resulting in the failure of the liveness probe and even the restart of the pod. Its default value is only 1 second, https In the negotiation stage, the CPU actually has certain requirements, which will lead to the continuous failure to grab the time slice, or the scheduling failure, which will occur. The optimization method is relatively simple and rude. We directly change the timeout to more than 10 seconds.
"Automatic" service dependency management
Let’s talk about scenario-independent optimization later, which is an old-fashioned automated service dependency management. As mentioned earlier, it is very important to reduce the amount of data sent to the proxy. One is to push less data to it, and the other is to not push to it when irrelevant data changes, so dependencies are very important.
Istio currently provides a sidecar API, which is unrealistic to maintain manually. I think this is to convert the insufficiency of infrastructure into an operational risk for users, so there is a rigid demand for automation. I think It is also a must-have component for grids. Its core idea is to generate and update the dependency relationship through the actual call data, and at the same time, we must correctly understand the traffic processing before the dependency relationship is complete.
In this regard, you can also learn about our open source Slime lazyload. Now, based on its dynamic dependency maintenance, we also support a more flexible semi-static dependency description, similar to conditional matching, which can be used to implement some It is a high-level feature, so we prefer to call it servicefence. In this component, we have also accumulated a lot of experience in production practice.
some Tips
Finally, let’s talk about a few Tips, not every scene will be encountered by everyone, but there will be some ideas that can be used for reference. For example, we have some scenarios where the connection is unbalanced. The native one is a current limit for self-protection. If there are a large number of connections or requests sent to a single node or the control plane of a single pod, it will do a The protection of current limit, but this does not make the unbalanced situation rebalance, so we have a component that will do balanced deployment, and it has also been open sourced.
Another scenario is quite special. If we have a large number of endpoints, its memory usage is a relatively big problem. At this time, we can consider using string pool optimization for enumerable content, especially for labels with high repetition. of.
As mentioned earlier, if I have a very large cluster, EDS push will even become a problem at this time. The solution is to split large resources and small resources. There was once a solution called EGDS, and there were some discussions in the community, although it is said that In the end, it did not enter the main body of the community, but it can be used as a reference for solving problems. Kubernetes actually has a similar idea.
The last one is that if we have a super-large service registry, our control plane can no longer bear all the configuration data or service data. At this time, we can consider grouping the control planes, and let the sidecars rely on the affinity of dependencies, so that each control plane only handles a part of the configuration.
Epilogue
That's all I want to share today. Finally, I would like to share my personal experience of doing service grid optimization for several years. Many of Istio's designs are indeed in line with a pragmatic path of current software engineering. First, a high-reliability method is used to achieve rapid evolution. Of course, what I once owed, It will still have to be made up in later practice. Thanks!
Related Links
How NetEase Shufan realizes the evolution of microservice architecture based on Istio
NetEase's Service Mesh Road: Will Istio be the next K8s?
NetEase Open Source Slime: Make Istio Service Mesh More Efficient and Smart
Slime 2022 Outlook: Packing Istio's Complexity into a Smart Black Box
Slime open source address: https://github.com/slime-io/slime
About the Author
Fang Zhiheng, senior architect of NetEase Shufan, responsible for Qingzhou Service Mesh, has participated in the construction of Service Mesh of many technology companies and the evolution of related products. He has been engaged in infrastructure and middleware research and development for many years, and has rich experience in Istio management and maintenance, function expansion and performance optimization.
From May 13th to June 15th, 2022, the Loggie community launched the Loggie Geek Camp open source collaboration event for cloud native, observability and log technology enthusiasts. Feel the essence of open source culture and the creativity of the open source community to create the future of cloud-native observability. It includes four types of tasks, including providing user cases, catching bugs, improving and submitting features. The submitted content is considered a success if it passes the community review. Those who perform well will be commended by NetEase Shufan and the Loggie community. Welcome to learn and participate: https://sf.163.com/loggie
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。