Interpretation of Dataphin&#39;s real-time research and development of batch integration

Introduction to Dataphin, as an enterprise-level intelligent data construction and management product, has full-link real-time research and development capabilities. Starting in 2019, it will support the real-time computing needs of the Group’s Tmall Double 11. The article will introduce in detail the real-time computing of Dataphin ability.

-For more information about the transformation of digital intelligence and the content of the data center, please join the Cloud Data Center Exchange Group-Digital Intelligence Club and follow the official WeChat official account (scan the QR code at the end of the article or click here to join )

Cloud Data Center official website 1613599511583d https://dp.alibaba.com/index

background

Whenever the double 11 global shopping carnival bell rings, tens of millions of users flock to Tmall and Taobao. Behind the smooth shopping experience is the camp created by Alibaba engineers with technology, which supports the peak of data brought by Double 11 every year. From November 1st to November 12th at 0:00, Tmall’s "Double 11" total transaction volume reached 498.2 billion yuan, and the total number of logistics orders reached 2.321 billion. Behind all this is inseparable from real-time computing technology.

As an enterprise-level intelligent data construction and management product, Dataphin has full-link real-time research and development capabilities, and will support the real-time computing needs of the Group's Tmall Double 11 starting in 2019. The following article introduces Dataphin's real-time computing capabilities.

Traditional data warehouse architecture

In the data warehouse construction process, generally speaking, the offline data warehouse is built first, and the application is built around the offline data at the same time. Then with the development of the business or the optimization of the experience, the link of real-time calculation is built to improve the timeliness of the data.

In this process, it is inevitable to write similar codes twice, and various problems such as inconsistency of real-time and offline calibers and increased maintenance costs respectively will occur.

The traditional data warehouse architecture flow and batch storage computing brings the following problems:

Efficiency problem : Inconsistent data models at the bottom of the flow batch result in a large amount of splicing logic in the application layer (year-on-year, quarter-on-quarter, secondary processing, etc.), low construction efficiency and error-prone
Quality problem : One business logic, two engines and two sets of codes, SQL logic cannot be reused, data consistency and quality problems are difficult to guarantee
Cost issue :
Streaming batch storage system isolation (for different writing scenarios), provides different data services, and high maintenance costs
Manually build data synchronization tasks, high development cost/storage cost (two copies)
Batch processing & stream processing clusters cannot achieve peak staggering and low resource utilization

Dataphin flow batch integration advantages

In order to solve the problem of the separation of storage and calculation in the traditional data warehouse architecture, the idea of "streaming batch integration":

Streaming batch storage is transparent, query logic is completely consistent, application-side access costs are greatly reduced, and point-check/OLAP analysis is unified support
Unified storage at the service layer, no manual synchronization, no duplicate storage
One set of code, two calculation modes, unified logic, flexible switching, greatly improved R&D efficiency
Stream batch computing resources are mixed, and resource utilization is improved

Dataphin provides more platform capabilities on top of Flink's integrated flow and batch capabilities, such as data source management, metadata management, asset blood relationship, asset quality control, pre-compilation, debugging and other capabilities:

Development and production isolation : Provide isolation between development environment and production environment to ensure that the business code developed in the development environment and production do not interfere with each other
Metadata Management : All system components including data sources, meta tables, UDX, etc. have access control functions, and sensitive configuration information is encrypted and protected. Support data source sensitive field access subscription. All unitized and visualized management of meta-tables, functions, resources, etc., supports cross-project authentication (field-level) calls, allowing users to focus on business logic.
Stream batch integration : Unified management of stream batch storage layer, realize unified model layer, unified stream batch code, independent configuration of stream batch, production independent and coordinated scheduling example
Research and development to improve efficiency :
Provides the capability of pre-compilation, and provides the functions of syntax verification, permission verification, and field blood relationship extraction;
Containerized debugging, supporting uploading custom data or directly consuming real production data to observe job operation and check the output results of each node
Support metadata retrieval, visual exploration of job dependency and field blood relationship
stability and quality assurance :
Support traffic threshold setting to prevent excessive competition for computing resources and avoid overloading of downstream systems
Support real-time meta-meter quality monitoring, configurable statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification.

Development and production isolation

Dataphin supports development and production isolation projects, and supports the data source configuration of development and production environments. In this way, in the development mode, the task will automatically use the development data source and the physical table in the development environment; and when released to the production environment, Datpahin will automatically switch to the production data source and the physical table in the production environment. This process is fully automated, and there is no need to manually modify the code or configuration.

Metadata management

Dataphin creatively introduced the concepts of real-time meta-tables and mirrored tables, and unified the management of the tables in the real-time R&D process on a platform and capitalization, simplified R&D, and improved R&D efficiency and experience.

Traditional real-time task research and development tools require users to repeatedly write Create table statements and perform cumbersome input and output table mapping operations. The real-time meta-table builds and manages all data tables used in real-time development tasks, and maintains all real-time meta-tables and related schema information in a unified manner. Developers do not need to write DDL statements repeatedly during the development process; at the same time, they do not need to perform complicated input, output, and dimension table mapping. The simple pure code development model, simple SET statement and permission application can be used to quote table data. Perform direct query or write data, easily create a table once and quote multiple times, greatly improving R&D efficiency and experience.

The mirror table, as the name suggests, is used to maintain the mapping relationship between the fields of the offline table and the real-time table. After the mirror table is created and submitted for publication, the fields of the mirror table can be used in the Flink task of the flow batch. Datpahin will be automatically mapped to the flow table and batch table at compile time to realize one code, two calculations, and code logic. , The caliber changes are strongly consistent.

Integrated code task of flow batch

In addition to the introduction of real-time meta-tables and mirroring tables, Dataphin also supports the integration of stream and batch tasks, using the Flink engine as a unified stream batch calculation engine, and the configuration of stream + batch tasks can be configured on one code at the same time, based on the same code generation Examples in different modes. For stream batch differentiated code, Dataphin also provides support in different ways.

The mirroring table is widely used in the integrated task of flow and batch, and the mirror table will be translated into the corresponding flow table/batch table in the final use. In order to adapt to the diversity of the flow table/batch table (the data source of the flow table/batch table may be different Same, the key in the with parameter may be different; some settings of the flow table/batch table may be different, such as batchSize, etc.), you can use tableHints to correspond to the flow table/batch table. Methods as below:

set project.table.${mode}.${key} --mode: Stream task: \`stream\` Batch task: batch

For example, set the start and stop time of batch tasks:

set project.table.batch.startTime='2020-11-11 00:00:00'; set project.table.batch.endTime='2020-11-12 00:00:00';

The second way is to configure the task parameters in the real-time and offline modes of the Dataphin task is to use the task parameters to replace.

Real-time quality monitoring

Dataphin real-time data quality is mainly for developers. According to the real-time output data table in the product, the data quality analysis and verification of the output result are carried out to ensure the ultimate validity and accuracy of the data. Dataphin supports statistical trend monitoring, real-time multi-link comparison, and real-time offline data verification.

Statistical trend monitoring : Trend monitoring refers to a monitoring method that captures abnormal fluctuations based on data trend changes and expert experience; for example, the trend of real-time GMV increases sharply and is somewhat abnormal
Real-time multi-link trend comparison : Real-time multi-link refers to the scenario of real-time computing. Due to the high cost of data recovery, it is impossible to quickly recalculate from the starting point. Therefore, multiple computing links need to be used. When abnormal, automatic/manual switching of computing links is a strategy of using resources for stability. This type is often used when there is a major guarantee business; for example, multi-link guarantees are used every year on double eleven big screens.
Real-time offline verification : Real-time offline verification is a commonly used measure to ensure real-time data. Because real-time calculation is in a continuous operation state, the calculation time is long-lasting and is greatly disturbed by resources and source data; offline data is in logic and data Reusability can be better operated. Therefore, in order to ensure the accuracy of real-time data, commonly used offline data is compared with real-time data; for example, offline data is used to verify real-time data before Double 11 every year;

Dataphin after double eleven big screen

Going back to the Tmall Double Eleven at the beginning of the article, and understanding the unique capabilities of the Dataphin platform, let's specifically disassemble why Dataphin can support the real-time data big screen of Tmall Double Eleven.

quick

Dataphin provides real-time one-stop services for R&D, debugging, testing, and operation and maintenance of the entire link, which greatly reduces the user development threshold;
At the same time, it provides unified metadata management. The metadata only needs to be initialized once, and it is easy to create a table once and quote multiple times, allowing development to focus on business logic and greatly improving R&D efficiency and experience;
In addition, students with data research and development experience have this experience. Many data calibers are surprisingly similar, and some are just different in input and output tables. Typical scenarios such as active and standby links. For this scenario, we provide the ability to develop templates. The same logic is encapsulated in the template, and the difference logic is reflected in the template parameters. New tasks only need to reference the template to configure the template parameters, which greatly improves the research and development efficiency and reduces the maintenance cost of the aperture.

Based on the above capabilities, with the support of the Double Eleven big screen, even though there are many business gameplay and demand blowout, only two people can support hundreds of demands.

stable

Dataphin provides task monitoring and data quality monitoring to ensure the stability of tasks and quickly find problems; the template-based master/backup multi-link can be switched in seconds when an abnormality occurs, and the bleeding can be quickly stopped; based on the real-time task blood relationship, the root cause of the problem can be quickly located; Debug, test, and configure fine-grained resources, quickly verify and repair, and truly achieve 1min discovery, 5min positioning, and 10min resolution.

allow

Based on the ability of stream batch integration, the code is truly unified, the caliber is unified, the storage is unified, and the data service interface is unified, and the research and development improve efficiency while ensuring the consistency of data.

future plan

The upcoming Flink VVP (Ververica Platform) adapted version will support the new VVR engine, and will also support the open source Flink engine in the future, which already supports more deployment environments. Dataphin will also continue to improve the capabilities and experience of real-time R&D, helping companies lower the threshold of real-time R&D, explore more scenarios, and obtain the business value brought by real-time data!

Data center is the only way for enterprises to achieve digital intelligence. Alibaba believes that data center is a combination of methodology, tools, and organization, which is "fast", "quasi", "full", "unified", and "passed". Smart big data system.

Currently by Ali cloud external output range of solutions, including common data sets of solutions , retail sales data desk solution , financial data desk solution , Internet data desk solution , Sub-scenarios such as and other subdivision scenarios for government data middle-office solutions.

Among them, the Alibaba Cloud Data Center product matrix is based on Dataphin and the Quick series is used as a business scenario cut-in, including:

-Dataphin, a one-stop, intelligent data construction and management platform ;
-Quick BI, intelligent decision-making anytime, anywhere ;
-Quick Audience, comprehensive insight, global marketing, and intelligent growth ;
-Quick A+, a one-stop data operation platform cross-multi-terminal global application experience analysis and insight;
-Quick Stock, an intelligent goods operation platform ;
-Quick Decision, an intelligent decision platform ;

official site:

Data Zhongtai official website https://dp.alibaba.com

Dingding Communication Group and WeChat Official Account

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Interpretation of Dataphin's real-time research and development of batch integration

background

Traditional data warehouse architecture