Introduction to In the current wave of China-Taiwan construction in the financial industry, many financial institutions still have many myths about China-Taiwan construction. Where will China-Taiwan construction go? How should data assets be managed? Alibaba's road to China-Taiwan construction should be a reference for financial institutions. A few days ago, at the 2021 Alibaba Cloud Financial Data Intelligence Summit held by Alibaba Cloud, Guan Tao, a researcher from the Alibaba Cloud Intelligent Computing Platform Division, shared the platform technology part of how Alibaba builds the core three elements of the data center, including data The four typical stages of platform development, the four major technical challenges that support the business of China and Taiwan, and the four major technical trends of the data platform.
Text / Guan Tao, researcher of Alibaba Cloud Intelligent Computing Platform Division
Four major stages of the development of Alibaba's data platform
To build a data center, a powerful data platform is indispensable as a base. The four stages of the development of Alibaba's data platform, to a certain extent, are actually the four stages of the development of Alibaba's data platform. In these four stages, you can see Alibaba’s extraction of the commercial value of its own data, the aggregation of the original divide-and-conquer data system, the new ideas for the capitalization of computing data and the efficient application of data, and the process of governance of the data platform Facing organizational changes, etc.
Phase 1: Business blossoms, discovering the value of data
From 2009 to 2012, Alibaba's e-commerce business entered an explosive period, and many well-known business teams emerged, such as Taobao, 1688, AliExpresss, and Yitao. Each business is based on a data-driven full-scenario business, and the business side has a strong demand for data. At that time, Alibaba's technology was almost always the IOE architecture, and the core data system was Oracle. Within two years, Alibaba has built the largest Oracle cluster in Asia. However, in 2010, Oracle could no longer meet the computing requirements. There were a lot of data delays and unsatisfaction, coupled with expensive costs, it was unable to continue to support business development. Alibaba began to seriously examine the importance of building the next-generation data platform, and at the same time launched two parallel projects: one is "Cloud Ladder 1", based on the open source Hadoop technology system, multiple business teams built multiple Hadoop clusters, the cluster size reached 4000 units server. One is "Cloud Ladder 2" (ODPS, now MaxCompute), which started research and development as a self-developed product of Alibaba, with a cluster size of about 1,200. The "Shepherd Dog" business of Ant's small and micro loan is the first business to eat crabs. The process of launching "Cloud Ladder 2" is called "human flesh cloud computing" and "step-by-step trial calculation". Academician Wang Jian read "Into the Thin Air Zone" on CCTV's "Reader" program in 2018, describing the current situation and beliefs of the self-developed data platform at that time. The two projects formed a state of competition and cooperation within Alibaba, and explored the development trajectory of Alibaba's data platform in parallel. During this period, the data of all business parties were almost vertically constructed, and they were rushing forward in the form of their own business forms forming independent small closed loops.
Phase 2: Small business vertical closed loop, data islands appear 
From 2012 to 2015, while Alibaba’s e-commerce business developed rapidly, more emerging businesses emerged: Cainiao was founded in 2013 and the “all-in wireless” strategy was launched; in 2014, it invested in AutoNavi and a joint venture with Yintai , Alibaba Travel was established; in 2015, Dingding/Retail Pass was launched, Word of Mouth was established, and Alibaba Health was controlled. During this period, Alibaba's business flourished, forming 12 business departments and 9 different platform systems, and the system architecture of each platform was different. The user's digital process required multiple data systems across multiple BUs. The phenomenon of data islands is becoming more and more serious, data costs are getting higher and higher, and the construction of a unified data platform is imminent. This is also the starting point for Alibaba's data center. At the same time, "Yunti 1" and "Yunti 2" are also undergoing major changes. On March 28, 2013, Yun Zheng, the architect of the Alibaba Group’s technical support department, delivered an email directly to the group’s senior executives: “According to the data increment and future business growth, the storage and computing capabilities of the Yunti 1 and Yunti 2 systems will be The bottleneck will be reached on June 21 this year. "At that time, many businesses will not be able to expand due to technical limitations. This means that the data platform can no longer parallel the two projects of "Yunti 1" and "Yunti 2" at the same time, and one of them must be selected. If you choose "Ladder 1", how to break through the 5000 node limit of Hadoop? When it comes to financial services, how does an open source system ensure the security and availability of big data? How to solve the problem of cross-computer room solutions without reference in the industry? Business interaction is frequent, how to ensure stable data interaction across computer rooms? A series of technical problems have gradually pushed the data platform to the road of self-research. In the end, Alibaba Group's multiple technical departments merged and decided to choose "Cloud Ladder 2" to challenge the 5K peak. In just a few months, "Cloud Ladder 2" has gone from 1,500 to 5,000, breaking the limit of a single physical computer room, passing the 10-fold stress test, and supporting cross-cluster computing and high availability, laying a foundation for Alibaba’s big data development for many years in the future. Established a solid technical foundation. After the 5K project completed the technological breakthrough, new pressures followed one after another. The rapid development of business has led to the rapid expansion of data scale. How to manage data in a unified manner, ensure data security in a unified manner, and have unified open capabilities have become the core of data platform thinking. To this end, Alibaba launched a relatively well-known project to synchronize all business department data to a unified big data platform for unified management. This project has gone through two years, involving all business units of Alibaba. In this process, it has gradually promoted the productization of general data platform capabilities and has the capabilities of financial-level platforms. From the perspective of the time, Alibaba's process of building a data platform was a process of comprehensively unifying data, and it was also the process of China's first ultra-large-scale data center construction and migration.
Stage 3: Data center supports sustainable business development
From 2015 to 2018, the methodology of Alibaba's data middle-stage system began to be established, which opened the curtain of data middle-stage construction. In 2015, after the Alibaba Group announced the launch of the "Middle Taiwan Strategy", it began to build a more flexible "large and medium platform, small front desk" organizational mechanism and business mechanism in line with the DT era. Each of Alibaba's operating juniors can formulate data-based operation strategies that cover the user's life cycle based on data. Business staff have begun to explore data business, and more businesses have begun to move towards real-time. However, the rapid growth of data and computing and the rapid consumption of resources have brought about the problem of data governance. The Alibaba team began to think about how to implement the methodology of the data center to the platform layer, and let the data platform support the construction of the data center.
Who owns the data? Who uses it? Who controls it? Who is responsible for data quality? · The platform team and the business team are two teams. What is the cost relationship? · How does the Zhongtai methodology apply to the data platform? How to manage? · Digital growth is fast, surpassing business growth, what should we do? · A core table of 12PB, one copy for each department, and tens of millions in a year, what should I do? · I know I want to delete half of the data, but which half is it?
Behind these problems is the governance and capitalization of data. We need a platform system to carry the methodology in and truly unify. On the data platform side, DataWorks builds a one-stop capability for large-scale collaborative data development and governance. MaxCompute supports server clusters up to 100,000, serving the daily operations of all BUs and more than 200,000 employees of Alibaba Group, and supporting the availability of various businesses. Continuous development.
Phase 4: Data Center and Business Companion in Cloud
After 2018, the entire Alibaba data platform system has been very mature, and the platform side and the business side have reached a very good state of cooperation. The business side recognizes the value of the data platform, the business department and the technical department coexist, and the data center service business has reached a positive cycle, which has become a sign of the successful construction of the data center. Alibaba started to go to the cloud from all internal systems in 2018, and by 2021, it has realized the mid-stage and business companionship of data on the cloud: 100% of the core system on Double 11 goes to the cloud, Alibaba's full cloud native; 538,000 transactions per second, Alibaba Cloud Resist the world’s largest traffic peak; the data center covers all BUs of Alibaba Group; the second operator finds and analyzes problems in a timely manner to achieve real-time operational decision-making; new services such as short video and live broadcast continue to emerge... You can see that Alibaba’s data The construction of China-Taiwan is successful and is still developing at a high speed.
The MaxCompute intelligent data warehouse makes Double 11 daily, and the integrated lake warehouse gradually becomes the next-generation big data platform architecture. The data center platform built by DataWorks supports hundreds of data applications in the group. Cost growth supports the rapid growth of the group's business.
Four core challenges of data platform construction
The core indicator of the success of a data center is not system efficiency, not platform efficiency, but "data efficiency". Alibaba mainly measures "data efficiency" from four aspects: scale and flexibility, data cost, data accuracy and maintainability, and data utilization.
Under this core indicator, methodology, organization, and platform capabilities are the core three elements for the success of data center. So, if you want to build a data platform, what are the methods behind it and what difficulties need to be paid attention to in the construction process? There is actually a lot of work to be done behind the scenes. This time I will only introduce four business-oriented aspects, and the challenges of storage and computing engines are not yet involved.
Challenge 1: Data Asset Management System
For data assets, the first question to be solved is: What is an enterprise's data assets? Each BU of Alibaba has a panoramic view of its own business department's data assets. We use a map to manage 99.9% of Alibaba's computing data assets. The storage and computing costs of each department will be quantified and displayed directly in front of managers . The second question: how to look at assets? For enterprises, are assets a number of costs? Through the perspective of data assets, Alibaba lets managers know where my own data comes from, to whom it serves, and who is my best partner, and at the same time it can meet the needs of data flow auditing. The third question: how to scale assets? New business mergers/acquisitions/innovations, how to quickly replicate this asset system? Provide data center modeling tools in DataWorks and other tools, which can provide standardized drawings for data center construction, divide different business domains, perform intelligent modeling, and allow new businesses to quickly reuse the previous mature data architecture to achieve assets The ability to scale.
Challenge 2: Data Quality System
For data quality, the first problem to be solved is: how to define ex-ante quality? The financial industry often mentions a concept called reconciliation, and Alibaba data also needs to be reconciled. For the reconciliation of data tables with more than tens of millions of levels, we have proposed the concept of "quality rules". There are more than 7 million quality rules, and more than 10,000 new ones are added every day. How do you match them manually? Alibaba built 37 kinds of rule templates, recommended matching through intelligent rules, and the adoption rate reached 75%. The second question: How to implement the quality in the matter? What should we do if more than 7 million quality rules require a lot of computing resources? How to reduce costs? We have built a data quality scheduling engine and an ETL engine through intelligent technology. After the data is changed, the quality monitoring is triggered in real time, and the priority strategy is adopted for idle operation. The third question: How to automate the quality afterwards? The rules are dead, but the data is alive. What should I do if we encounter periodic fluctuations and changes? When building data quality, we incorporate many artificial intelligence technologies, learn the way data is generated through machine learning, can intelligently predict dynamic thresholds, and match periodic fluctuations through algorithms.
Challenge 3: Data Security System
For data security, how to reduce the cost of use and improve the ease of use; how to cover the full life cycle of data; how to control permissions; how to desensitize data, how to identify sensitive behaviors for data traceability, etc., Alibaba has accumulated internally. More than 20 different security governance rules, these rules can ultimately help the platform to meet the requirements of personal compliance while meeting the rapid growth of the business.
Challenge 4: Data Governance System
When data governance enters the deep water zone, how does the growth rate of data cost not exceed the growth rate of business; how to mobilize the enthusiasm of all employees for governance and cultivate cost awareness. In Alibaba, data governance is the interaction of engines, platforms, and people, and the engine has an impact on computing power. The pursuit of the ultimate in cost and cost continues to break the linear relationship between fast-growing data calculations and cost growth. The platform has become the core indicator of the data governance battle of each team in the group by storing and calculating health points, promoting people to do data governance and management, and use the platform to fully Link tools to build a data governance technology operation system. Through such a cost report, the cost and value of the platform layer are clearly displayed. It can be seen that during the 12-year construction of the data platform, Alibaba has accumulated the ability to productize data in the middle of Taiwan from multiple latitudes such as data assets, quality, security, and governance.
As the base of the middle platform, where will the data platform go next?
In the future, as the base of the middle station, the data middle station will change from data intelligence to intelligent data. "Lake warehouse integration" can meet the flexible upgrade of architecture, "smart data warehouse" will solve the problem of data management under ultra-large scale, and "intelligent query" will be greatly reduced. The threshold of data analysis, AI's cloud native/scale/standardization and inclusiveness make it the ultimate export of big data, and it continues to accelerate the integration of big data and AI.
Trend 1: One lake and one warehouse on both sides
As the next-generation data platform architecture, Hucang integrates to meet the flexible upgrade of the architecture under the complex current situation. The data warehouse focuses on enterprise-level data, processing more refined, more economical, and more efficient. Enterprises can build their own data center, whether it is engine optimization or data management, there is a set of methodology and supporting tools. However, the barriers to entry are high, the cost is expensive, and there are barriers to use. The data lake is a technology born out of an open source system, with low entry barriers and costs, and is relatively flexible. It is easy for enterprises to build their own data lakes, but in addition to the unified storage of data, enterprises need to further conduct various refined management, hoping that data can be managed. Capable of management, low cost, and operation and maintenance. How to break through the separation of the data lake and data warehouse, integrate the flexibility of the data lake and the enterprise-level capabilities of the data warehouse in the architecture, Alibaba proposes an integrated lake and warehouse architecture, unified storage and metadata, open up the data system, and use smart data warehouses The technology automatically classifies storage and processing for different data and obligations.
Trend 2: Data warehouse enters the era of "autonomous driving"
Super-large-scale data brings management problems, and the traditional "DBA model" is already difficult to handle. Alibaba has more than tens of millions of tables, many core data development engineers, one person is responsible for tens of thousands of tables, there is no way to do refined governance and modeling, such a system cannot be expanded in a human way, so in the future, more and more More AI technologies will be integrated into big data systems and enter the era of "autonomous driving".
Trend 3: What you check is what you get, intelligent data query based on natural language
Alibaba is trying to build an ultra-large-scale knowledge graph on top of data. It uses knowledge graphs to translate data to the semantic layer, and then integrates with users through technologies such as NLP (Natural Language Processing) to form a bridge. For example, a user can automatically generate a copy of the data by inputting what Beijing Internet customers have. Alibaba is trying to apply intelligent query through natural language to massive data and scale it up so that more non-professional data personnel can complete data analysis work independently.
Trend 4: Data is intelligence, the basic capabilities of AI engineering
Data requires intelligent acceleration, and AI is the ultimate export of big data. We know that it is very difficult to really use AI. From the initial rise of data, data refining, model training, model tuning, to model deployment and service, the entire link is very long. If we have 50,000 people who can directly use data, and there may be no more than 5,000 people who can actually use AI, how to empower business parties with AI technology along with data is the so-called AI engineering.
Finally, to summarize, the above content is only a general reference to the four typical stages of the Alibaba data platform base construction, the four major technical challenges encountered, and the four major technology trends of the data platform. These topics are not in the Alibaba data. All of the station. Through 12 years, Alibaba's data platform construction has accumulated a lot of technology, and these platform capabilities are also continuously promoting the evolution of data center to intelligence, and will continue to evolve, serving Alibaba and exporting to the whole society. .
Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。