Author: Mo Yuan
Cloud native technology and cost reduction and efficiency enhancement
In 2020, the new crown epidemic has swept the world, and a large number of enterprises have been shut down, factories have been shut down, and supply chains have been interrupted, which has brought a huge impact on the global economy. 65% of enterprises have begun to consider improving their IT informatization capabilities by going to the cloud to deal with other systemic risks that may appear in the future. Cloud native technology, as the most advanced way to go to the cloud, has become the best choice for most enterprises to carry out IT information transformation.
According to the 2020 "Cloud Native Comes of Age" survey by well-known consulting firm Capgemini, only 15% of enterprises have built new applications in a cloud-native environment, but this proportion will increase to 52% in the next three years. In the report, companies that deploy more than 20% of their applications in cloud-native environments are defined as leaders. How do they view cloud-native technologies?
87% of surveyed companies say cloud native improves efficiency and reduces costs. 84% of respondents said cloud native drives a better customer experience. 80% of the companies surveyed said that the wait time for new products and services has been significantly reduced.
According to the 2021 CNCF "FinOps Kubernetes Report" research report, after migrating to the Kubernetes platform, 68% of the respondents said that the cost of computing resources in their enterprises has increased, and 36% of the respondents said that the cost has soared by more than 20%. Even with cost reduction and efficiency enhancement, which is the consensus of most leading enterprises, there are still many obstacles in the process of cloud-native transformation for many enterprises, and even more costs are paid. Why has cloud-native technology been adopted, but it is still far from ideal so far away?
Start with a real case
Raymond is the IT platform leader of an Internet e-commerce business. In the past 2 years, he has led the team to transform all the company's business into cloud native. The original intention of Raymond to choose cloud-native technology as the platform architecture is very simple, because cloud-native technologies represented by microservices, containers, and DevOps can deliver and operate different types of applications in a unified manner, reducing management costs; Automated construction and delivery can be achieved through pipelines to improve the speed of research and development; resource sharing and elasticity between applications can be achieved through container technology, reducing resource waste; through co-location and preemption between different types of applications, the utilization of cluster resources can be further squeezed Rate.
business platform | business description |
---|---|
E-commerce main site | For cyclical business, the daytime is low on weekdays, and the peaks are on weekdays and nights and holidays. In the big promotion scenario, there are peak traffic. |
Big data platform | Including ad hoc queries and reports/ETL jobs of the data lake, ad hoc queries are mainly based on Presto, and the main data research and development of jobs is submitted through workflow; ETL jobs are mainly Spark offline jobs. |
Micro Merchant Platform | Multi-tenant SaaS business, each tenant has independent quota and usage. |
Live platform | Periodic business, the daytime is low on weekdays, and the peaks are on weekdays at night and holidays, and there are unpredictable peak traffic. |
Transcoding/Training Platform | Temporary tasks, fragmented jobs, run for a short time. |
Raymond's team is responsible for the stable operation of the company's five major platforms. According to the characteristics of the business, the convenience of operation and maintenance, the level of security, and the consideration of cost, Raymond has divided the business into three clusters:
- Cluster A - Master/Transcoding Cluster
The main site has high requirements for business stability. The entire cluster is planned with a static node pool, which can be expanded in advance with the ability to scale regularly before the peak of the business. When the capacity is low during the day, the space of the cluster is time-divisionally reused by co-located transcoding services, thereby improving resource efficiency.
- Cluster B - Live/Big Dataset
The reason for putting the live broadcast service and the big data service in a cluster is that whether it is ad hoc query of the data lake, live broadcast service or ETL operation of big data, the consumption of computing resources per unit time is very large, but the business There is a relatively large randomness in the size of the capacity, and highly elastic scenarios are more suitable for both businesses.
- Cluster C - Micro Merchant Cluster
The micro-merchant business is independently placed in a cluster, mainly for security reasons, to isolate tenant data and business data. Also, separate clusters allow for better costing.
As a very senior cloud native expert, Raymond's technology selection, cluster splitting, and optimization strategies are all impeccable. In the first month of business cloud native, it was stable and efficient, and everything seemed to be moving towards expectations. Results in progress.
"Last month's bill increased by 70%?", Raymond muttered to himself after getting the latest bill, what was the problem?
Difficulties in Enterprise Cloud Native IT Cost Governance
Previously, Raymond's team used a more traditional, mature model of static enterprise IT cost governance. The cycle of this model is usually monthly or quarterly. Through the implementation of four stages: resource planning, cost estimation, cost budgeting, and cost control, IT assets are purchased to achieve the goal of enterprise IT cost governance.
The advantage of this model is that the cost budget obtained from each IT cost governance is fixed, which is very friendly from the perspective of IT asset management. However, the disadvantages are also obvious. When the business has frequent changes in capacity, it may cause a large deviation in the cost estimation stage, resulting in a lot of waste.
Cloud native technologies are often used to reduce costs and increase efficiency, such as intelligent scheduling, elastic scaling, co-location, and time-sharing preemption. The adoption of any new technology will inevitably transform and optimize the existing system architecture, and the dynamic transformation introduced by the cloud-native technology architecture often breaks the traditional IT cost management system in the enterprise, resulting in the loss of IT cost management. When IT cost management is out of control, various optimization strategies become rootless.
When Raymond tried to find clues about the problem through the bill, what he got was a hundreds of pages of monthly bill details. It is almost impossible to trace back the application and department that caused abnormal expenses from the bill details. And the problem that Raymond encountered is a problem that almost every person in charge of cloud-native architecture must overcome.
So, what makes enterprise cloud-native IT cost governance difficult?
- Difference between business unit and billing unit life cycle
In the traditional enterprise IT cost governance model, there is a certain matching relationship between the business unit and the billing unit, for example: a portal website contains two ECSs, an access layer gateway SLB, and a database RDS. Its business unit and billing unit are one-to-one, and the bill is the cost.
However, in a cloud-native scenario, when an application is deployed in a container cluster such as Kubernetes, all resources are pooled, the smallest unit of measurement for a business is a Pod, and the life cycle of a Pod is the same as that of the node that actually generates the bill. is not a match. In most scenarios, when the application is redeployed, the Pod of the business will be rescheduled to other nodes, which leads to the fact that the business unit and the billing unit may not be able to achieve the same goal in terms of logic, space, and time. One-to-one matching relationship.
This makes it difficult for the business department of the enterprise to obtain specific results when it wants to measure, plan, and estimate the budget of a business.
- The contradiction between dynamic resource delivery and static capacity planning
In traditional enterprise IT governance models, the relationship between planning/budgeting and resource delivery is static. Business departments can submit budgets according to the monthly, quarterly and annual cycles, and then the IT department will conduct unified procurement and allocation. In order to solve the problem of resource waste in the static capacity planning model, the container adopts technologies and solutions such as elastic scaling. Control capacity costs through dynamic resource delivery.
However, the dynamic resource delivery model may introduce other cost traps in actual production. Typically, the traditional static planning model mostly adopts the annual and monthly billing method, while the dynamic resource delivery model will mix various models such as annual and monthly subscription and pay-as-you-go. Even in some scenarios, special payment strategies such as Saving Plan, Reserved Instance coupons, and Spot Instances will be introduced. In contrast, the unit price of annual and monthly billing is about 30-50% of that of models such as pay-as-you-go. When the proportion of dynamically delivered resources is unreasonable, it may cause a lot of waste of IT costs.
In addition, budgeting and procurement in the traditional static capacity planning model are implemented in one phase, so that IT cost governance does not need to focus on cost trends. However, when a large number of dynamic resource delivery models are implemented, enterprise IT administrators need to pay attention not only to changes in total costs, but also to cost trends. Unexpected large-scale overruns have occurred.
- Adaptation of Enterprise IT Cost Governance Model and Cloud Native Architecture
In terms of cost control, the traditional IT cost governance model focuses more on the dimension of efficiency improvement. By improving the utilization rate of machines, the cost of the next capacity planning stage can be reduced. In cloud-native IT cost management scenarios, efficiency enhancement and cost reduction are carried out at the same time. Enterprises can adjust resource quotas through monitoring, intelligent recommendation, etc. to improve resource utilization; through elastic scaling, dynamic resource delivery, etc., Realize resource cost reduction. The method of reducing costs and increasing efficiency at the same time will greatly shorten the cycle of the enterprise IT cost governance model, and put forward more requirements for budget management quota management, cost trend forecast, and cost trend alarm.
- Side effects of misuse of inappropriate cost optimization schemes
The optimization method of the traditional IT cost governance model is relatively simple, usually through the guidance of indicators such as resource utilization, to achieve the purpose of reducing costs and increasing efficiency. In cloud-native scenarios, various optimization methods emerge one after another. However, any optimization scheme will bring challenges to the stability of the existing architecture, such as:
- When using elastic scaling, it is necessary to consider the degree of matching between the scaling sensitivity and the peak of business traffic; it is necessary to consider the graceful offline of the business during scaling; it is necessary to consider whether it will cause a black hole of cost (a lot of waste of resources caused by abnormal reasons, such as: caused by DDOS). CDN resource overcommit) and so on.
- When using big data elastic supply, it is necessary to consider whether the cluster has idle resources that can be reused; it is necessary to consider whether the running time of temporary data jobs is too long, resulting in an unreasonable resource billing model; it is necessary to consider whether the utilization of resources during elastic supply is not. as expected and so on.
Essentially, the optimization of cloud-native scenarios mainly focuses on the dynamic nature of scheduling/resources. By means of moving, time-sharing, preemption, and scaling, the utilization of resources is improved, and the overall cluster water level or total core time cost is reduced. . Most optimizations are based on domain scenarios. Before implementing cloud-native IT cost optimization solutions, enterprises need to measure and evaluate the risks brought by architectural changes and the expected benefits of the optimization solutions.
The above four problems are obstacles that cannot be avoided in IT cost management during cloud-native transformation of every enterprise, which restricts the pace of enterprises' cloud-native transformation, and also troubles a large number of cloud-native technology leaders such as Raymond. In order to solve the above problems, cloud-native IT cost management solutions came into being.
Alibaba Cloud Enterprise Cloud Native IT Cost Governance
Alibaba Cloud Container Service ranks first with AWS, and is the cloud service provider with the most complete container products in the world. As early as 2006, Alibaba Group began to promote the implementation of cloud-native technology within the Alibaba Group. Sixteen years of experience in cloud-native practice have enabled Alibaba Cloud to better empower enterprises with their thinking and understanding of cloud-native and help enterprises realize IT Information transformation.
In recent years, with the acceleration of enterprise cloud adoption, the concept of cloud financial management ( FinOps ) has been mentioned and adopted by more and more enterprises . Best practices and culture come together to improve an organization's ability to understand cloud costs. It's a practice that brings financial accountability to cloud spending, enabling teams to make informed business decisions. Cloud Financial Management ( FinOps ) enhances collaboration between IT, engineering, finance, procurement and the enterprise. It enables IT to evolve into a service organization focused on leveraging cloud technology to add value to the business. When cloud native technology and cloud financial management (FinOps) concepts are intertwined, the concept of cloud native IT cost governance (Cloud Native FinOps) is born, which is a kind of cloud financial management (FinOps) concept in cloud native scenarios. Evolution and evolution.
Alibaba Cloud Container Service has launched an enterprise cloud-native IT cost management solution to help enterprises provide functions such as enterprise IT cost management, enterprise IT cost visualization, and enterprise IT cost optimization in cloud-native cloud scenarios. Alibaba Cloud's enterprise cloud-native IT cost management solution has five core functions:
Core Function 1: Unique Cost Allocation and Estimation Model for Cloud-Native Container Scenarios
In order to solve the problem of inconsistent life cycles between business units and billing units in the container scenario, Container Service proposes a unique cost estimation model that combines billing and metering, and adds cost policies (payment types, savings plans, vouchers, user Consideration of factors such as discounts, bidding fluctuations), allocation factors (CPU, memory, GPU card, GPU memory, etc.), resource form (ECS\ECI\HPC), etc., to achieve cost estimation for Pod dimensions and cost allocation for the proportion of clusters. Through billing analysis, all resource costs of the cluster in a stage are aggregated, and combined with the cost allocation capability of the Pod dimension, a complete cloud-native container scenario cost allocation and estimation model is realized.
Core function 2: multi-dimensional cost insight, trend prediction, root cause drill-down
Supports cost insight in four dimensions: cluster, namespace, node pool, and application (label wildcard matching). The cluster dimension focuses on the distribution of cloud resources, the trend change of resource costs, the ratio of cluster water level to waste, and the trend and cost of cluster costs. Prediction can help IT administrators to accurately determine the trend of cost consumption and prevent scenarios that exceed budget Trend correlation analysis, assist department administrators in cost estimation, drill down to analyze cost waste, and improve department resource utilization; the node pool dimension focuses on resource cost planning and governance, through instance type, unit core time, scheduling water level, and utilization rate Correlation analysis of water levels to assist IT asset managers to optimize resource mix and payment strategies. The application (label wildcard matching) dimension focuses on cost optimization in field scenarios. For example, various upper-layer application scenarios such as big data, AI, offline jobs, and online applications can be used for real-time cost estimation and task-level cost insights through the cost insight of the application dimension. Cost accounting.
Through the cost insight of four dimensions, the cost optimization functions and solutions of the whole scene can be supported by data, and the cost reduction and efficiency increase can be carried out with reason.
Core Function 3: Cost Optimization Capability and Solution Coverage in All Scenarios
For the actual business scenarios of different enterprises, Alibaba Cloud Container Service provides resource profiling, cost optimization capabilities and solutions for all scenarios (see the end of the article for details):
- Elastic scaling
- mixed department
- Intelligent resource portrait
- Cloud native big data/AI
- Cloud-native workflow
In addition, most of the cost optimization strategies of enterprises need the support of business scenarios, and there will be customization and secondary development in many scenarios. Therefore, the cost insight capability provided by Alibaba Cloud Container Service's enterprise cloud native IT cost management solution is completely decoupled from the upper-level optimization solution, and the cost insight capability in four dimensions can cover the measurement and evaluation of cost optimization methods in all scenarios.
Core function 4: multi-cluster/multi-cloud/hybrid cloud cost management capability for all types of cloud
Multi-cloud is a new trend for enterprises to migrate to the cloud. The billing models of different cloud vendors are quite different. For example, the common annual and monthly payment method of domestic cloud service providers, and the common credit card withholding/postpayment of international cloud service providers , savings plans supported by some cloud service providers, reserved instances, etc. These all provide more challenges to the cost analysis capability of the multi-cloud cloud pipe plane. The enterprise cloud-native IT cost management solution of Alibaba Cloud Container Service supports the access to the cost data of mainstream cloud service providers and IDC self-built computer rooms by providing unified billing and inquiry access and default implementation of cloud service providers. And manage costs through a consistent cost allocation and estimation model for cloud-native container scenarios. Cooperate with the enterprise-level cloud-native distributed cloud container platform ACK One (Alibaba Cloud Distributed Cloud Container Platform) to realize the unified control plane of multi-cloud cloud management and asset management.
Core Function 5: Expert Services for Enterprise Cloud Native IT Cost Governance
Enterprise cloud-native IT cost governance is not only a product capability or solution, but also an evolution of enterprise IT management, organizational processes, and culture in the cloud-native era. Alibaba Cloud Container Service Team and Alibaba Cloud Tianji Team provide complete products and expert services covered by the FinOps concept through Alibaba Cloud Cloud Asset Manager.
Alibaba Cloud Cloud Asset Manager, as a domestic cloud product evaluated by the "General Maturity Model of Financial Operation Capability for Cloud Resources", assists enterprises to implement: cost process governance, cost insight, cost optimization, cost operation, etc., and helps enterprises build cloud-native overall IT The cost platform accelerates IT innovation and IT decision-making after the enterprise is fully cloudized.
go back to the real scene
Faced with Raymond's dilemma, how to optimize the cost through the enterprise cloud native IT cost management solution provided by Alibaba Cloud Container Service?
Step 1 : Raymond first uses the cost analysis capability of the cluster to check the difference between the cost trend and the cost budget of the cluster, and can draw a preliminary conclusion about the abnormal cost.
cluster name | Is it over budget | over budget |
---|---|---|
Cluster A - Master/Transcoding Cluster | Yes | 5% |
Cluster B - Live/Big Data Cluster | Yes | 140% |
Cluster C - Micro Merchant Cluster | no | -9% |
According to the cost of the cluster, it can be seen that the waste of the main body is in cluster B. Then, the drill-down analysis can be performed mainly on cluster B.
Step 2 : Check the cost structure of the cluster to determine the optimization direction and drill-down strategy.
In this cluster, it can be seen that computing resources are the main component of the cost, so the direction of the problem can be drilled down to the perspective of resource utilization and the unit price of core time for further analysis.
Step 3 : Check the resource utilization of the cluster and the unit price per core time
Judging from the scheduling water level of the cluster, it reaches 78%, which is an ideal situation. There is a certain amount of space to continue scheduling without being too wasteful. Judging from the actual resource utilization rate, only 3% of the actual utilization rate is used, indicating that there are scenarios in which resources are allocated but not fully used. In addition, looking at the core-hour unit price of the node pool, the unit price of one of the node pools containing Spot Instances is close to the pay-as-you-go unit price, which indicates that the specifications of the selected Spot Instances are unreasonable, resulting in the excessive price per core hour. high.
Step 4 : Drill down to the application dimension and locate the problem application
Through the namespace dimension, it can be located that some namespaces have obvious capacity changes of peaks and valleys, and after the capacity expansion, the utilization rate of resources does not fluctuate or change significantly, indicating that the regular scaling does not bring any business benefits. profitable.
Through the resource waste list provided in the namespace, you can see the names of the applications that have a lot of waste. Fill in the label of the application, you can see that the current application is basically empty, but it accounts for 34.74% of the overall consumption of the cluster.
After confirming with R&D students, Raymond found that the scheduled scaling was configured on a test business that had not yet been launched, and the number of replicas configured for scaling was relatively large, resulting in a large waste of resources. In addition, the cost of the spot instance combination in the cluster is soaring due to the inventory problem. It is necessary to configure the availability zone and specifications of the new spot instance. So far, Raymond has reconfigured the timing scaling rules and corrected the configuration combination of the bidding instance, and the problem that has troubled him for a long time has been solved.
In fact, when we look back at Raymond's problems, they are all small things that may be encountered in actual production, and it is these inconspicuous little things that may cause huge capital losses in enterprise IT cost management. The higher the complexity of the IT system, the more automated the operation and maintenance system needs to be. Similarly, the more cloud-native means of reducing costs and increasing efficiency, the more data-based and transparent IT cost management solutions are required. Reducing costs and increasing efficiency is the goal, emphasizing the results rather than the process. Relying on the enterprise cloud-native IT cost management solution, the goal of enterprise IT cost optimization can be achieved transparently, digitally, and automatically.
The future of cloud-native enterprise IT cost governance
It is foreseeable that in the future, the concept of cloud financial management (FinOps) will be mentioned and adopted by more and more enterprises, and the capabilities and solutions to reduce costs and increase efficiency will also spring up like mushrooms after a rain. However, from the actual situation, the concept of IT cost management in most enterprises has not kept up with the evolution of the architecture, which has brought a greater burden to the cloud-native transformation of enterprises. To fully drive and implement the cloud-native IT cost optimization strategy, the concepts, tools, and processes of cloud-native IT cost governance must come first. Only observable, quantifiable, and measurable optimization solutions can truly prove their value.
Alibaba Cloud's enterprise cloud-native IT cost management solution helps enterprises implement the concepts, tools and processes of enterprise IT cost management, allowing enterprises to digitally achieve enterprise IT cost management and optimization in the process of cloud-native, becoming a practitioner in the field of FinOps with the leader.
Related Links
[1] "Gartner Report: Alibaba Cloud Becomes the Most Complete Cloud Service Provider for Global Container Products"
https://developer.aliyun.com/article/763157
[2] Elastic scaling:
https://help.aliyun.com/document_detail/119099.html
[3] Intelligent resource portrait:
https://help.aliyun.com/document_detail/413944.html
[4] Cloud native big data/AI:
https://help.aliyun.com/document_detail/201994.html
[5] Cloud native workflow:
https://help.aliyun.com/document_detail/157124.html
Click here to view the Alibaba Cloud Enterprise Cloud Native IT Cost Governance Solution Document!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。