Introduction to served as the producer and lecturer of the 2021 GIAC Conference Cloud Native Session, and organized four speeches. During this process, the author also learned a lot from these peers’ speeches as a listener. knowledge. This article can be regarded as a sidelight on the 2021 GIAC cloud-native special session. It is a glimpse of the current situation of cloud-native technology development in 2021 and the trend in the future.
Author | Yu Yu
I was fortunate to be the producer and lecturer of the cloud native session of the GIAC conference in 2021. I organized four speeches before and after. During this process, I also learned a lot of useful knowledge from the speeches of these colleagues as an audience at the same time. This article can be regarded as a sidelight on the 2021 GIAC cloud-native special session. It is a glimpse of the current situation of cloud-native technology development in 2021 and the trend in the future.
The term cloud native has a wide range of meanings, involving all aspects of efficient use, delivery, deployment, and operation and maintenance of resources.
From the system level, we can distinguish cloud-native basic settings [such as storage, network, management platform K8s], cloud-native middleware, cloud-native application architecture, and cloud-native delivery operation and maintenance system. The four topics of this special session also basically cover These four major directions:
- Amazon's senior technical expert Huang Shuai's "A Cloud Native Service's Explosion Radius Governance"
- "Kaishou Middleware Mesh Practice" by Jiang Tao, Director of Service Grid of Kuaishou Infrastructure Center
- Tetrate observability engineer Ke Zhenxu's "Use SkyWalking to monitor Kubernetes events"
- "Dubbogo 3.0: The Cornerstone of Dubbo in the Cloud Native Era" produced by the person in charge of the Dubbogo community
The following describes the main points of each topic based on personal on-site notes and personal memories. Due to limited time and personal ability, some mistakes are inevitable, so experts please correct me.
The explosive radius of cloud native services
Personally, Huang's topic belongs to the category of cloud-native application architecture.
The content of his speech first started with a failure of Amazon AWS ten years ago: The configuration center of a certain AWS service is a CP system. A man-made network change caused the redundant backup node of the configuration center to be destroyed. When the operation and maintenance personnel urgently restored the change Later, due to the unavailability of the configuration center [the number of effective copies is less than half], other data nodes of the entire storage system believe that the configuration data consistency is incorrect and deny service, which eventually leads to the crash of the entire system service.
The direct reason for the replay of the entire accident is that the CAP theorem has very strict definitions of availability and consistency, and is not suitable for application in actual production systems. Therefore, the data of the configuration center of the online control plane should first ensure the availability under the premise of ensuring final consistency.
Furthermore, the human operation errors of modern distributed systems, network abnormalities, software bugs, exhaustion of network/storage/computing resources, etc. are all inevitable. Designers in the distributed era generally use various kinds of redundancy [such as more Storage partition, multi-service copy] means to ensure the reliability of the system and build reliable services on top of unreliable software and hardware systems.
But there is a misunderstanding: sometimes some redundant methods may cause the reliability of the system to decrease due to the avalanche effect.
As in the above accidents, human configuration errors have led to a series of software system failures, and these failures are highly correlated, and finally lead to the avalanche effect, which can be called the "horizontal expansion poison effect". At this time, the dimension of thinking is further expanded from "providing reliable services on unreliable software and hardware systems" to "reducing the explosion radius of accidents through various isolation methods": When an inevitable failure occurs, try to control the failure loss to The smallest, the guarantee is within an acceptable range, and the service is available.
In response to this idea, Huang gave the following fault isolation methods:
- Moderate service granularity The service granularity of microservices is not as fine as possible. If the service granularity is too fine, it will lead to too many services. The first consequence is that almost no one in an organization can figure out the ins and outs of the overall logic of the service, which increases the burden on maintenance personnel: everyone only dares to make minor repairs and no one dares to do. A substantial optimization and improvement. The second consequence of too fine service granularity is that the overall microservice unit body increases exponentially, which causes the cost of container orchestration and deployment to rise. Moderate service granularity should take into account the evolution of the architecture system and the reduction of deployment costs.
- When fully isolated for service orchestration, obtain the power supply and network topology information of the data center to ensure that the deployment of strongly related systems is "not far" and "not close". "Not close" means that copies of the same service are not deployed in the same cabinet using the same power supply, nor deployed in the Azone using the same network plane. "Not far" means that the deployment distance cannot be too far. For example, multiple copies can be deployed in the same city with multiple IDCs. Use these two principles to balance performance and system reliability.
- Random partitioning The so-called random partitioning is essentially mixing service requests, ensuring that requests for a certain service can go through multiple channels [queues], and ensuring that the processing of requests for a certain service will not be affected when some channels are down. Using random partition technology, users are scattered among multiple cells, and the explosion radius is greatly reduced. It is quite similar to Shuffle Sharding in the K8s APF fair current limiting algorithm.
- Chaos engineering uses continuous internalization of the practice of chaos engineering to step on thunder in advance to minimize "fault points" and improve system reliability.
Use SkyWalking to monitor Kubernetes events
Although this topic is scheduled for the third speech, it belongs to the cloud-native delivery operation and maintenance system, but it is relatively closely related to the previous topic, so I will describe it here first.
How to improve the observability of the K8s system has always been a central technical problem for major cloud platforms. The basic data of the observability of the K8s system is the K8s event, which contains the full link information of Pod and other resources from request to scheduling and resource allocation.
SkyWalking provides logging/metrics/tracing and other multi-dimensional observability capabilities. It was originally only used for observing micro-service systems. This year it provides the skywalking-kubernetes-event-exporter interface, which is specifically used to monitor K8s events and purify events , Collect and send to SkyWalking backend for analysis and storage.
During the speech, student Ke spent a lot of energy to talk about how rich the visualization of the entire system is. The points of personal interest are shown in the following figure: filtering and analyzing events in a way similar to big data streaming programming.
Its visualization effects and streaming analysis methods can be used for reference by the Ant Kubernetes platform.
Mesh implementation of fast-handed middleware
Kuaishou Jiang Tao mainly explained the practice of Kuaishou Service Mesh technology in this topic.
Jiang divides the Service Mesh into three generations. In fact, there are many classification standards, and how to divide them makes sense. Obviously, Jiang put Dapr into the third generation.
The above picture is a diagram of the Service Mesh architecture of Kuaishou, which obviously borrows from Dapr's idea: sink the ability of basic components to the data plane, and standardize the request protocol and interface. Some specific tasks are:
- Unified operation and maintenance, improve observability and stability, perform fault injection and traffic recording, etc.;
- Made a secondary development to Envoy, etc., only transferred the changed data and obtained it on demand, to solve the problem of excessive number of single-instance services;
- A lot of optimizations have been made to the protocol stack and serialization protocol;
- Implementation of failure-oriented design, Service Mesh can fallback to direct connection mode.
I am personally interested in that Jiang mentioned the three challenges that Service Mesh technology faces when it is implemented quickly:
- Cost issues: unified deployment and operation and maintenance in a complex environment.
- Complexity issues: large scale, high performance requirements, and complex strategies.
- Landing promotion: not a strong demand for the business.
In particular, the third challenge is that the general direct beneficiary of Service Mesh is not the business side, but the infrastructure team, so there is no strong demand for the business, and the real-time business platform such as Kuaishou is very sensitive to performance, and Service Mesh technology is inevitable. Here comes the increase in delay.
In order to promote the implementation of Service Mesh technology, Kuaishou's solution is:
- First of all, ensure the stability of the system and do not rush to roll out the business volume;
- Major projects of the ride-hailing company and actively participate in the upgrading of the business structure;
- Based on WASM scalability, co-construction with business;
- Select typical landing scenes and set benchmark projects.
Jiang finally gave the Service Mesh work of Kuaishou in the second half of the year:
Obviously, this route is also deeply influenced by Dapr. It is not very innovative in theory or architecture. It focuses more on standardizing open source products and launching it in Kuaishou.
In his speech, Jiang mentioned two benchmarks for the implementation of Serivce Mesh technology: Ant Group and Bytedance. In fact, one of the very important reasons for their success is the high-level attention to advanced technology and the vigorous cooperation from the business side.
Dubbogo 3.0: The cornerstone of Dubbo in the cloud-native era
As a lecturer on this topic, I did not emphasize too much on the existing features of Dubbo 3.0 in my speech, but focused on the form of Service Mesh and flexible services.
One of the more important points of Dubbo 3.0 is Proxyless Service Mesh. This concept is actually the beginning of gRPC and the focus of recent gRPC ecology. Its advantages are non-destructive performance and easy microservice upgrades. However, gRPC's own multilingual ecology is very rich, and another reason why gRPC advocates this concept is a modest framework that emphasizes stability. Its performance is not very good. If you consider the form of Proxy Service Mesh, its performance is even more worrying.
The biggest disadvantage of the Dubbo ecosystem is that in addition to Java and Go, other multi-language capabilities are not very good. I personally feel that it is not a good idea to follow gRPC Handan to learn to completely block other language capabilities. The dubbo-go-pixiu project produced by the Dubbogo community solves the multilingual capabilities of the Dubbo ecosystem in two forms of gateway and sidecar, and unifies the north-south traffic and east-west traffic into Pixiu.
Regardless of the form of Service Mesh technology, its domestic development has passed the first wave of climax. After the two benchmarks of Ant Group and Bytedance, it has gone to a descent. The combination of business allows small and medium-sized manufacturers to see their business value, and it will usher in the second wave of follow-up.
Service Mesh itself is particularly suitable for hybrid cloud or multi-cloud environments that help small and medium-sized manufacturers migrate services to on top of K8s. Most of these environments use a large number of open source software systems, which can help them get rid of the dependence of specific cloud vendors.
Dubbo 3.0's flexible service can basically be understood as back pressure technology. The reason why Dubbo and Dubbogo want to provide flexible services is that abnormal nodes are the norm in the cloud-native era, and accurate assessment of service capacity cannot be measured:
- Machine specifications: It is inevitable that the machine specifications will be heterogeneous under large-scale services [if affected by oversold], even if the machines of the same specification age at different speeds;
- Complex service topology: The distributed service topology is constantly evolving;
- Unbalanced service traffic: there are peaks and valleys;
- Uncertainty of dependent upstream service capacity: cache/db capacity changes in real time.
The answer lies in: adaptive current limiting on the server side, and adaptive load balancing on the service calling side [client].
The basic idea of adaptive current limiting is based on the improvement of little's law of queuing theory: queue\_size = limit * (1-rt\_noload/rt), the meaning of each field is as follows:
- limit The upper limit of qps within a period of time.
- rt\_noload The minimum value of RT in a certain period of time.
- rt The average RT over a period of time, or it can be directly taken as P50 RT.
That is, two forms of RT are used to evaluate the appropriate performance of method-level services. The increase in RT reflects that the overall load{cpu/memory/network/goroutine} increases, and the performance will decrease. Conversely, the decrease in RT reflects that the server can handle more requests.
adaptive current limit: server calculates the queue\_size at the method level and calculates the number of goroutines used by the current method inflight [assuming that each client request costs one goroutine], the server receives a certain method every time After the new request is made, the queue\_size is calculated in real time. If inflight> queue\_size, the current request is rejected, and the difference between queue\_size-inflight is fed back to the client through the response package.
Adaptive Load Balancing: The client receives the load queue\_size-inflight of a certain method returned by the server through a heartbeat packet or response. A weight-based load balancing algorithm can be used for service calls. Of course, in order to avoid the herd effect The instantaneous pressure of a service node can also provide a P2C algorithm, and Dubbogo can implement it for users to choose. The above overall content, the community is still discussing, not the final version.
Summarize
From 2017 to the present, individuals have participated in more than a dozen domestic technical conferences of various levels, both large and small, as both a producer and a lecturer. The speech level is not high, but the basic time control ability is okay, so I can't pull the field. This time I hosted the cloud native session of GIAC. The audience's rating for this session was 9.65 [horizontal ratings for all sessions], and the overall performance was acceptable.
It is fortunate to live in this era and have witnessed the ups and downs of the cloud-native technology tide. I am also very fortunate to work on the platform of Alibaba and have witnessed the gradual implementation of Dubbogo 3.0 in various scenarios within Alibaba Cloud DingTalk.
New book recommendation
"Alibaba Cloud Cloud Native Architecture Practice" is officially produced by Alibaba Cloud, recommended by Zhang Jianfeng, President of Alibaba Cloud Intelligence, and Cheng Li, CTO of Alibaba; from design principles, patterns/anti-patterns, technical options, design methods, industry cases and other dimensions Comprehensively summarize the methodology and practical experience of Alibaba Cloud's cloud native architecture.
Now open 5 discount for a limited time, you can click to buy directly 👇
http://product.m.dangdang.com/29250860.html?unionid=P-113341856m-:-dd\_1
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。