Introduction: After years of development, the data system of Alibaba Da Tao has supported complex business scenarios with rich data and products, and has achieved a very large leading edge in the data field. As the scale of data grows and there are more and more developers, although Alibaba's big data system is standardized for unified management, there is no effective model design and control on the product side. Problems such as the reusability of the general layer have gradually become prominent. Storage costs increase, efficiency decreases, specifications weaken, data usage becomes more difficult, and O&M burdens increase. In order to solve these problems, we carried out a special project on the governance of the Da Tao system model, and pursued the ultimate goal of reducing costs and improving efficiency while serving the data service business.

Introduction: After years of development, the data system of Alibaba Da Tao has supported complex business scenarios with rich data and products, and has achieved a very large leading edge in the data field. As the scale of data grows and there are more and more developers, although Alibaba's big data system is standardized for unified management, there is no effective model design and control on the product side. Problems such as the reusability of the general layer have gradually become prominent. Storage costs increase, efficiency decreases, specifications weaken, data usage becomes more difficult, and O&M burdens increase. In order to solve these problems, we carried out a special project on the governance of the Da Tao system model, and pursued the ultimate goal of reducing costs and improving efficiency while serving the data service business.

Participating teams:

Data Technology and Product Department - Data Team of Da Tao Department

Data Technology and Products Department - Data Security Production Platform

Computing Platform Division-DataWorks Product and R&D Team

**1 Data Status
**

In order to better analyze the data problems of the current Da Tao series, we carried out a detailed data analysis, first of all, digitized. (The entire problem analysis is supported by detailed data and involves data security, so it only abstracts the problem and does not show specific data details).

1 Normative issues

  • Table naming is not standardized and lacks control: With the growth of data volume, a large number of tables in the Da Tao series have been named without following the Alibaba big data system, which is difficult to control.

2 Common layer reusability problem

  • The reusability of the general layer table is not high: the number of downstream references of the general layer table is less than 2;
  • Insufficient construction of the general layer or insufficient leakage of the general layer: cdm citations decrease, ads citations increase;
  • The common logic of many ads tables is not sinking: there are many cases where the codes of the ads tables are repeated and the field similarity is high;

3 Application Layer Efficiency Issues

  • There are many temporary tables, which affects data management: there are many TDDL temporary tables, PAI temporary tables, machine temporary tables, and pressure measurement temporary tables, etc.;
  • Unreasonable distribution of common layer tables among teams: spread across multiple teams;
  • More ads table common logic is not sinking;
  • Some ads tables have deep dependencies in the surface layer: many ads tables have a depth of more than 10 layers in the application layer;
  • The cross-market dependency problem of the application layer is obvious: ads between different markets are interdependent, which not only affects the stability of the data, but also makes it difficult to guarantee the accuracy of the data;
  • There are a large number of common layer tables that can be handed over: the common layer data of different teams is mixed with Da Tao data;
  • Unbalanced distribution of table personnel: The number of tables managed by table owners is unevenly distributed. Some owners have only dozens of tables, and some owners have thousands of tables.

**two problem analysis

**

Through the digitization of current data problems, we found that the problems involve all aspects of data evaluation, construction, management and use.
Comment: There is a lack of a unified data evaluation system. In the past, the discovery of data problems was mainly obtained through expert experience, discovery of development and use, and discrete data analysis, lacking a unified digital evaluation system. How much data is there? How is the data distributed at different levels? How formal is the naming of the tables? How reusable is the table? What is the processing efficiency and consumption efficiency of the table? How to evaluate the quality of data construction, use and maintenance? What metrics should good data be evaluated by?
Construction: Based on the analysis of data problems, we found that in the period of unified construction and governance of the general layer, the data performed well in terms of standardization, reusability, link complexity, and use efficiency, but there was no unified construction and management. Governance time, data is not doing well in every way. The reason is: we have a set of Alibaba big data system specifications, but we do not have a set of modeling development products covering design, review, development, control, and governance.
Guan: After the data construction is completed, there is no effective management of the cost, reusability, efficiency, and health of the data, which usually relies on centralized governance, special governance or push governance. High cost and slow iteration. At the same time, there is also the problem of uneven distribution of table management. Some owners undertake a lot of management and operation and maintenance work, and it is difficult to maintain the data after the handover, which makes it difficult to use the data.
Use: Data is ultimately for use. According to data analysis and survey questionnaires, the following problems are commonly found: difficult to find numbers, inability to use, and dare not to use. As a result, in addition to some very core model data, many developers would rather re-develop rather than spend a lot of energy to find and understand data, resulting in a vicious circle.

**Three solutions

**

For the analysis of the problem, we identified the following goals:
1. Model digitization: Build a general Da Tao system model evaluation system, which can clearly evaluate the health of current data from multiple dimensions, and provide improvement suggestions for problematic data.

2. Efficiency improvement and sinking of public models: Clearly define the standard for sinking data in the general layer, which can clearly define which data should be deposited in the general layer, and the data that needs to be deposited should be deposited in time.

3. Productization: A set of modeling development products covering design, review, development, control, and governance are developed through joint construction.

4. Daily governance: Daily monitoring of model health and optimization of governance.

5. Find numbers and improve efficiency: Improve data retrieval efficiency and recommendation accuracy through joint construction, and display core data in data albums.

In order to achieve the above goals, we have carried out the overall design of model governance:
 title=

1 DataWorks co-construction

DataWorks is based on big data computing engines such as MaxCompute/EMR/Hologres, providing a professional, efficient, safe and reliable one-stop big data development and governance platform. Through in-depth co-construction with the DataWorks team, using the data experience accumulated by Datao Department for many years, such as model, development, operation and maintenance, to provide input and DataWorks' powerful product research and development capabilities to upgrade functions such as intelligent modeling, development assistants, and data maps. Realize data design, development, management and control, and use full-link productization to solve long-standing data problems.
 title=

**DataWorks Intelligent Data Modeling

**

At present, DataWorks intelligent data modeling products have completed major product function iterations of four major product modules: data warehouse planning, data standards, dimensional modeling, and data indicators, with reverse modeling, forward visualization modeling, excel modeling, code building Modeling and other product capabilities, and completed the heavy release of new functions of DataWorks intelligent data modeling products at the 2021 Yunqi Conference.
The newly released core product functions of DataWorks intelligent data modeling products mainly include the following:
Data warehouse planning:

  • Supports business customization of elements (such as data domains, data marts, etc.) required by the classic hierarchical domain scheme of data warehouses at the public layer and application layer;
  • Support business customization of data warehouse specifications, such as standard definition of table names at each layer;
  • It supports modeling space, supports setting up the relationship between modeling space and data R&D space, and meets the needs of Datao Department of multi-service shared data standard and overall management of data model.

Dimensional modeling:

  • It supports the reverse modeling of the existing physical tables of the data warehouse, and solves the cold start problem of the existing physical tables of the Da Tao system.
  • It supports forward modeling of dimension table, detailed table, light summary table and application layer table, and supports three methods: visual modeling, excel file import model and code modeling. The function of the forward visualization modeling product draws on the classic modeling theory accumulated by the modeling students of the Da Tao department, and relies on the advantages of MaxCompute to quickly replicate the table structure of the existing physical tables in MC and support dimension field redundancy based on existing fields. In addition, the summary table and application layer table can quickly refer to the created indicators to generate model table fields. The forward modeling excel file import model commercializes the classic model excel template accumulated by the students of the Da Tao department for several years, to meet the modeling needs of some habitual excel modeling students. The forward modeling product function greatly improves the modeling efficiency.
  • The designed model supports model review and release of physical tables to five engines, including MaxCompute and Hologres.
  • The successful model is released, which realizes the connection with DataStuido (data development), and supports the automatic generation of ETL framework code. Data development students only need to supplement the business logic code on the basis of this code. This function improves data development to a certain extent. The research and development efficiency of students;

The above product functions can well solve the purpose of standardization and efficiency improvement of model construction.

Data warehouse planning

 title=

dimensional modeling

 title=

**Development Assistant

**

The development assistant can perform permission reminders, release control, and automatic construction of temporary tables during code development.

 title=

2 model points

**Model scoring logic

**

 title=

**Model is divided into large market

**

We display the model sub-assessment internally in the form of a digital dashboard, and directly jump to the corresponding product page to operate the corresponding governance suggestions through governance jumps.
In order to achieve better reuse, model points support quick configuration and access. As long as you provide a project list, you can quickly access the data of the corresponding BU by modifying the configuration, and generate table-level, owner-level, and BU-level model points and governance actions.
The governance items of the model board use the full-link blood relationship and labeling capabilities, which can achieve targeted governance more accurately.

3 Find numbers to improve efficiency

Find the number and efficiency plan:

 title=

At present, the data map has launched the team's common table, guess you will use, popular browsing, popular reading, data album, search optimization, table description upgrade, etc. The table description function has been upgraded; the multi-person collaborative maintenance, display and Modified the release of collection notes, and the function of adding instructions for use to albums. It is of great significance for finding, using and maintaining data. 【Search & Recommend】Search result filtering enhancement

  • The filter conditions on the left side of the search results reveal high-frequency usage conditions for users to choose, which improves the filtering efficiency and search CTR.
  • The field search function is restored, and the search filter supports filtering according to the environment.

[Content & Organization] Table Description Function Upgrade

  • The description editor of the upgrade table is the Yuque editor, which supports importing the content of Yuque to help solve the problem of caliber

【Content & Organization】Data Album

  • The data album provides administrator functions and supports multi-person collaborative maintenance.
  • Added album support to display and modify collection notes.
  • In the personal album details page, you can search through the table description and collection notes.
  • Add the function of instruction manual to the album

[Content & Organization] Data map and DataWorks data are connected

  • Supports identifying the table as a model table in the map and displaying the model information of the table.

1) Search recommendation  title=

2) Data Album

The core table is displayed in the data album, which can effectively realize the search and use of the core table.

3) Album Description

Centrally manage structured knowledge, support the import of Yuque knowledge, and better manage and maintain data.

4) Data Bai Xiaosheng

Algorithmically process data knowledge, and realize table lookup and table use through robot question answering. To this end, we have built an intelligent question answering robot combined with our internal robot products.

**Four thinking summary

**

After the FY22 Da Tao Department Model Governance Project, through the internal development of the Da Tao Department and co-construction with the DataWorks team and the data security production platform, the following important capabilities have been achieved:

  1. Model evaluation system: Design and define the Da Tao system model evaluation system, which realizes data health evaluation and table-level governance suggestions from multiple dimensions.
  2. Intelligent modeling: In cooperation with the DataWorks intelligent modeling team, blockbuster products such as data warehouse planning and dimension modeling have been launched, and the forward and reverse modeling of dimension tables, detailed tables, light summary tables and application layer tables has been realized.
  3. Data map upgrade: The important functions such as search recommendation, data album, and table description have been upgraded, which greatly improves the efficiency of finding, using, and managing data.
  4. Collaboration rules and regulations: Define clear rules such as quasi-general layer sinking specifications, collaboration rules, handover procedures, and new talent training mechanisms, providing clear institutional guarantees.

**Five follow-up planning

**

At present, the Da Tao system model governance has achieved very good phased results, and has achieved good results in terms of product co-construction, model evaluation, and efficiency improvement. But there are still some unresolved issues:

  1. Difficulty guaranteeing unified architecture and specifications: Different businesses have inconsistent understanding of Ali's big data system specifications, and it is difficult to guarantee the unification of group data architecture and specifications;
  2. The general business layer is relatively thin: under the historical background, the construction of the general business layer of each business is relatively weak, and there are risks in business efficiency and caliber under the new structure;
  3. The ADS layer continues to grow, and the complexity is difficult to control: Alibaba's big data system specification lacks the specification for the application layer, the boundary between the ADS and the general layer is not clear, and the complexity of the ADS is difficult to control;
  4. Lack of effective control: At the level of data development and operation and maintenance, Alibaba has accumulated big data system specifications and continuously integrated with the data platform, but some standards cannot be enforced to integrate with the data platform. In terms of data governance, the current data cannot effectively identify whether the data table is invalid, resulting in the research and development of the data table not dare to download, lack of energy;
  5. Data construction and use have not been fully connected: the current data development and data use have not yet been fully connected, and the defined models and developed data have not been effectively exposed and managed in the data map.

The next phase will further address the unresolved issues in depth:

  • Da Tao system model architecture

We will upgrade the existing architecture problems, and upgrade the methodology from the aspects of architecture principles, design specifications, development specifications, operation and maintenance specifications, governance specifications, and co-construction mechanisms, so as to better adapt to the current status of data research and development. Effectively provide effective guarantee for cost reduction and efficiency improvement from the architectural level.

  • Smart Modeling

Continue to build with the DataWorks team to further improve the development efficiency of the general layer and application layer, and provide guarantees from the product level.

  • data map

Quick access to official albums: At present, the construction of official albums requires special personnel to configure and maintain. In the future, we can consider reducing the access cost and decentralize it to each team for independent access and maintenance, so as to improve the richness and ease of use of data albums.

Further open up the data development and use links: further open up the intelligently modeled data and the data map to realize the rapid screening and disclosure of the core model.

Improve the ability of table query and use from multiple perspectives: from the aspects of table description, table answering, data knowledge extraction, etc., to improve the ease of looking for tables, using tables, and table answering questions, combining text algorithms, robots and other technologies and product capabilities to realize data Intelligent generation of knowledge.

  • development assistant

The development assistant can be further upgraded in terms of table recommendation and table management.

  • Da Tao system general level evaluation system upgrade

According to the current model points, add the information related to the blood of the model, and build a general layer of the thick Da Tao system to provide better data support for the general layer for the business.

Automatic offline of tables: realize automatic offline of models, tables and services & offline of expert experience, improve the efficiency of data offline and reduce the cost of manual intervention.

DataWorks intelligent data modeling product help document: https://help.aliyun.com/document\_detail/276018.html

The article is reproduced from Ali developer

Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。