Author: Mo Yuan
foreword
China Property & Casualty Insurance is a leader in the domestic Internet finance industry. In the process of enterprise cloud native cloud, it has completed the microservice and containerization of a large number of multi-tenant SaaS businesses. Its business has very typical financial attributes, which puts forward higher requirements and challenges in terms of architecture stability, resource cost efficiency, and data security. It is necessary to take into account business stability while reducing costs and increasing efficiency. During the migration process, we encountered challenges such as the difficulty in clearing the cost of multi-tenancy services, the difficulty in finding idle/wasted resources, and the difficulty in balancing optimization strategies and business stability. Based on Alibaba Cloud's enterprise cloud-native IT cost management solution, China Property & Casualty's engineering team has built a mature IT enterprise cost management process and system. Through out-of-the-box business cost splitting, visual discovery of idle resources, elastic scaling and co-location and other optimization strategies to optimize the idle resource rate of the cluster from 30% before cloud migration to less than 10%.
China Property & Casualty's cloud IT cost management work has also been awarded the 2022 Excellent Case of Cloud Management and Cloud Network by CAICT: https://mp.weixin.qq.com/s/XBOcLcW9C0TO9mKhH7svbw
China Property & Casualty's cloud-native road
Cloud-native cloud migration is currently the best path for enterprises to migrate to the cloud. As a leader in the domestic Internet finance industry, China Property & Casualty Insurance is also driving its business to realize digital transformation through micro-service and cloud-native approaches. Before the cloud native, the business of China Property & Casualty Insurance had the following problems:
- The management authority of business resources is scattered in each team, and the production environment and test environment are sunk in the business team, and the business team has redundant resources for the temporarily verified version.
- Some services have obvious periodicity, and the peak-to-valley capacity varies greatly, resulting in a long running time under low resource load.
- The stress testing environment requires a large number of temporary machines per unit of time. Reusing idle resources requires moving machines and coordinating cross-team resources, resulting in high processes and costs.
- There is a lack of quantifiable indicators to find business waste, and a simple utilization indicator cannot be used as a waste evaluation standard.
In order to solve the above problems, the engineering team of China Property & Casualty Insurance migrated the business to Alibaba Cloud Container Service through the microservice and containerization of the business, and based on the Alibaba Cloud enterprise cloud native IT cost management solution, a set of mature IT The enterprise cost management process and system reduce the IT cost management cycle from the original quarterly and monthly to weekly and daily. Through out-of-the-box cost visualization and allocation capabilities, real-time measurement of team resource waste can be achieved to achieve digital cost reduction and efficiency increase.
Here are some critical paths in the optimization process:
- Logic management, asset splitting, and waste measurement of multi-tenant business through namespaces
The engineering team of China Property & Casualty Insurance manages the multi-tenant SaaS business in the same cluster through the namespace as a logical unit. By adjusting the ratio between Request and Limit, the original independent capacity management model is transformed into a pooled one. Unified management to improve resource utilization. Through the namespace cost accounting capability provided by Alibaba Cloud's enterprise cloud-native IT cost management solution, it is possible to easily allocate expenses for different businesses within a cluster, and realize capacity management and financial management.
Discover cluster waste and cost distribution of each application through ACK cost analysis
- Full road stress test for capacity estimation and reliability verification
During the process of cloud nativeization, the engineer team of China Property & Casualty Insurance found that the capacity estimate submitted by the business team deviates greatly from the actual resource usage. Therefore, in the process of going to the cloud, the engineering team of China Property & Casualty Insurance uses the PTS (Alibaba Cloud Full-Link Stress Test Service) high-simulation environment full-link stress test to determine the system water level and bottleneck, reasonably estimate the resource requirements, and The cost scale is modeled through digital indicators, and the control of the cost scale is realized under the precursor of ensuring the reliability of cluster capacity.
- Establish cost and waste metrics to identify waste
It is not convincing enough to judge whether there is waste in the business simply by the value of resource utilization. The strategy of redundant capacity of the business team is generally based on the peak of the business, the utilization range of the efficient operation of the program, and future business. development and other factors. Redundancy is the best choice to ensure stability when the traditional cost governance cycle is longer on a monthly, quarterly or even annual basis. In order to solve this problem, the engineering team of China Property & Casualty Insurance proposed an application waste degree model, which integrates multiple factors such as resource utilization, peak and trough amplitude, introduction of business circuit breakers, and changes in business cost trends, and digitally quantifies the waste ratio. The real waste within the cluster was discovered.
Discover the waste of cluster applications through ACK cost analysis
- Time-sharing online business and temporary business off-peak use
In the business scenario of China Property & Casualty Insurance, there are a large number of temporary tasks and simulation tasks. These tasks have the characteristics of short cycle and high resource consumption. The engineer team of China Property & Casualty Insurance found that the real utilization rate of the cluster has been at a relatively low level during the day. The idle time is sufficient for the execution of simulation tasks and temporary tasks. In addition, when using time-sharing multiplexing, it also cooperates with the preemption strategy of fast up and fast down, which not only ensures the overall utilization of the cluster is improved, but also enables temporary offline operations to ensure the overall service when the burst traffic arrives. stability.
- Timing scaling to achieve pre-supply of core business resources
Some businesses of China Property & Casualty have obvious periodicity and peaks and valleys, and the ratio of resources differs by several times. In the case of ensuring a certain redundancy, by using the method of timing scaling, more cluster scheduling resources can be released, allowing other resources Temporary jobs can run faster.
- Idle resource recovery and business flexible delivery
When the resources are pooled, the scheduling water level of some nodes will be low because the scheduling policy of nodes is not marked and constrained. By identifying nodes with low water levels for a long time, the idle resources in the cluster can be found and the waste of resources can be reduced. . And optimize the delivery of some low-frequency resources in an elastic way to further improve cost efficiency.
The infrastructure team of China Property & Casualty Insurance has gone through the process of online production business from traditional IT architecture to cloud-native and cloud-native. During this cloud-native process, China Property & Casualty's business volume has also doubled. . After a series of measures to optimize cloud costs, after containerization of a business, the total configuration is reduced: 232C 400G, which saves cloud computing resources of about seven 32C 64G ECSs and reduces server costs by about 20%. After optimizations such as co-location and elastic scaling of business peaks and valleys, the average cost optimization rate can reach about 15%.
at last
From a certain point of view, the architecture optimization strategy of the China Property & Casualty Insurance team is very simple and practical. Optimized from 30% before going to the cloud to less than 10%. Enterprise IT cost management has never been a testing ground for new technologies. Choose a plan suitable for your own situation, quantify the results with data, and drive enterprises to reduce costs and increase efficiency with reason.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。