Introduction to create enterprise call technology based on Flink+Hologres to build a unified data service accelerated real-time data warehouse
Author: Chen Jianxin, a data warehouse development engineer for Caller Technology, currently focusing on the integration of the offline and real-time architecture of Caller Technology's big data platform.
Shenzhen Laidian Technology Co., Ltd. (hereinafter referred to as "Laidian Technology") is a pioneering enterprise in the shared power bank industry. Its main business covers self-service charging of power banks, development of customized shopping mall navigation machines, advertising display equipment and advertising dissemination services. Laidian Technology has a three-dimensional product line in the industry, large, medium and small cabinets, and desktop models. Currently, more than 90% of the cities across the country have achieved business service landing, with more than 200 million registered users, meeting the needs of users in all scenarios.
1. Introduction to Big Data Platform
(1) Development history
The development process of the caller technology big data platform is mainly divided into the following three stages:
1. Discrete 0.X Greenplum
Why is it discrete? Because there was no unified big data platform to support data services before, it was up to each business development line to fetch data or do some calculations, and use a low-profile version of the Greenplum offline service to maintain daily data needs.
2. Offline 1.0 EMR
Later, the architecture was upgraded to offline 1.0 EMR. The EMR here refers to Alibaba Cloud's elastic distributed hybrid cluster service composed of big data, including common components such as Hadoop and HiveSpark offline computing.
Alibaba Cloud EMR mainly solves our three pain points: First, the level of storage and computing resources is scalable; second, it solves the development and maintenance problems caused by the heterogeneous data of each business line, and the platform is cleaned and warehoused; third, we can build Our own data warehouse hierarchical system divides a subject domain to lay a solid foundation for our indicator system.
3. Real-time, unified 2.0 Flink+Hologres
The "Flink+Hologres" real-time data warehouse that is currently being experienced is also the core of this article. It has brought two qualitative changes to our big data platform, one is real-time computing, and the other is unified data services. Based on these two points, we accelerate the exploration of knowledge data and promote the rapid development of business.
(2) Platform capabilities
In general, the 2.0 version of the big data platform provides the following capabilities:
1) Data integration
The platform now supports real-time or offline integration of business databases or business data logs.
2) Data development
The platform now supports offline computing based on Spark and real-time computing based on Flink.
3) Data service
The data service is mainly composed of two parts: one part is the analysis service and ad hoc analysis capability provided by Impala, and the other part is the interactive analysis capability for business data provided by Hologres.
4) Data application
At the same time, the platform can be directly connected to common BI tools, and business systems can also be quickly integrated and connected.
(3) Achievements
The capabilities provided by the big data platform have brought us a lot of achievements, which can be summarized in the following five points:
1) Horizontal expansion
The core of the big data platform is the distributed architecture, so that we can scale storage or computing resources horizontally at low cost.
2) Resource sharing
The resources available to all servers can be integrated. In the previous architecture, each business department maintained a set of clusters, which would cause some waste, it was difficult to ensure reliability, and the freight cost was high. Now, the platform is uniformly scheduled.
3) Data sharing
It integrates all the business data of the business department and other heterogeneous data source data such as business logs, and is uniformly cleaned and connected by the platform.
4) Service sharing
After the data is shared, the platform will uniformly export services to the outside world, and each business line does not need to repeat the development by itself, and can quickly get the data support provided by the platform.
5) Security assurance
The platform provides a unified security authentication and other authorization mechanism, which can achieve different levels of fine-grained authorization to different people to ensure data security.
2. Data requirements of enterprise business
With the rapid development of business, the construction of a unified real-time data warehouse is imminent. Comprehensive platform architecture of version 0.x and 1.0, and judgment of the current development and future trends of comprehensive services, the requirements for building a data platform of version 2.x are mainly concentrated in the following Aspects:
1) Real-time large screen
Real-time large screens need to replace the old quasi-real-time large screens and adopt more reliable and low-latency technical solutions.
2) Unified data service
High-performance, high-concurrency, and high-availability data services have become the key to a unified data portal for enterprise digital transformation. It is necessary to build a unified data portal for unified external output.
3) Real-time data warehouse
The importance of data timeliness in business operations has become increasingly prominent, requiring faster and more timely response.
3. Real-time data warehouse and unified data service technical solutions
(1) Overall technical architecture
The technical architecture is mainly divided into four parts, namely data ETL, real-time data warehouse, offline data warehouse and data application.
- Data ETL is the real-time processing of business databases and business logs, using Flink real-time calculations uniformly,
- Data in the real-time data warehouse is processed in real time and then stored and analyzed in Hologres
- Business cold data is stored in Hive offline data warehouse, and synchronized to Hologres for further data analysis and processing
- Hologres unifies the commonly used BI tools, such as Tableau, Quick BI, DataV, and business systems.
(2) Real-time data warehouse data model
As shown above, there are some similarities between the real-time data warehouse and the offline data warehouse, except that there are fewer links to other layers.
- The first layer is the original data layer. There are two types of data sources, one is the Binlog of the business database, and the second is the business log of the server. Kafka is used as the storage medium.
- The second layer is the data detail layer. The information in the original data layer Kafka is extracted by ETL and stored in Kafka as real-time details. The purpose of this is to facilitate the simultaneous subscription of different downstream consumers, and to facilitate the subsequent use of the application layer. Dimension table data is also stored through Hologres to meet the following data association or condition filtering.
- The third is the data application layer. In addition to getting through Hologres, Hologres is also used to connect to Hive, and Hologres provides unified upper-layer application services.
(3) Data flow of overall technical architecture
The following data flow diagram can concretically deepen the planning of the overall architecture and the overall data flow of the data warehouse model.
As can be seen from the figure, it is mainly divided into three modules, the first is integrated processing, the second is real-time data warehouse, and the third is data application.
From the inflow and outflow of data, there are two main core points:
- The first core is Flink's real-time calculation: it can be obtained from Kafka, or directly Flink cdt can read MySQL Binlog data, or directly write back to Kafka cluster, which is a core.
- The second core is the unified data service: now the unified data service is completed by Hologres, which avoids the problems caused by data islands, or the consistency that is difficult to maintain, etc., and also accelerates the analysis of offline data.
Four, specific practical details
(1) Selection of big data technology
The program execution is divided into two parts: real-time and service analysis. In terms of real-time, we chose the full hosting of Alibaba Cloud Flink, which has the following advantages:
1) State management and fault tolerance mechanism;
2) Table API and Flink SQL support;
3) High throughput and low latency;
4) Exactly Once semantic support;
5) Flow batch integration;
6) Value-added services such as full hosting.
In terms of service analysis, we chose Alibaba Cloud Hologres interactive analysis, which brings several benefits:
1) Extremely fast response analysis;
2) High concurrent reading and writing;
3) Separation of computing and storage;
4) Simple and easy to use.
(2) Implementation of real-time large-screen business practice
The picture above shows the comparison of the old and new solutions with real-time large screens.
Take the order as an example. The order in the old scheme is synchronized from the order from the library to another database through DTS. Although this is real-time, in terms of calculation and processing, it is mainly through timing tasks, such as the scheduling interval set to 1. Real-time updates of data can be completed in minutes or 5 minutes, and sales and management need to grasp business dynamics in more real time, so they cannot be regarded as real-time in the true sense. In addition, slow and unstable response is also a big problem.
The new solution uses Flink real-time computing + Hologres architecture.
The development method can fully utilize Flink's SQL support. For our previous MySQL computing development method, it can be said to be a seamless migration and achieve rapid landing. Data analysis and services uniformly use Hologres. Take orders as an example. For example, today’s order revenue, today’s order users or today’s order users, as business diversity increases, it may be necessary to increase the city dimension. Through the analysis capabilities of Hologres, it can perfectly support the rapid display of revenue, order volume, number of order users, and some indicators of the city dimension.
(3) Implementation of real-time data warehouse and unified data services
Take a business scenario as an example, such as a relatively large business log, with an average daily data volume at the TB level. Let's first analyze the pain points of the old scheme:
- Poor data timeliness: Due to the large amount of data, the hourly offline scheduling strategy was used for data calculation in the old scheme. However, this solution has poor timeliness and cannot meet the real-time needs of many business products. For example, the hardware system needs to know the current status of the equipment in real time, such as alarms, errors, and empty positions, and make corresponding decision-making actions in a timely manner.
- Data island: The old solution uses Tableau to connect a large number of business reports. The reports are used to analyze the number of devices reported in the past hour or the past day, and which devices reported abnormalities. For different scenarios, the data previously calculated offline by Spark will be backed up and stored on MySQL or Redis. In this way, multiple systems form data islands. These data islands are a huge challenge to platform maintenance.
Now through the 2.0 Flink+Hologres architecture, business logs can be transformed.
- In the past, TB-level log volume was completely pressure-free under Flink's low-latency calculation framework. For example, the previous link from Flume HDFS to Spark was directly abandoned and replaced by Flink. We only need to maintain a Flink computing framework.
- When equipment status data is collected, it is all unstructured data. The data needs to be cleaned and then returned to Kafka, because consumers may be diversified, which can facilitate multiple downstream consumers to subscribe at the same time.
- In the scenario just now, the hardware system requires high concurrency and real-time query of the status of tens of millions of devices (power banks), which requires high service capabilities. Hologres provides high concurrent reading and writing capabilities, and establishes a primary key table associated with the state device, which can update the state in real time to meet the real-time query of the device (power bank) by the CRM system.
- At the same time, Hologres will also store the latest hot detail data, and directly provide external services.
(4) Business support effect
Through the new solution of Flink+Hologres, we support three scenarios:
1) Real-time large screen
At the business level, iterate diversified requirements more efficiently, while reducing development, operation, and maintenance costs.
2) Unified data service
A HSAP system is used to achieve service/analysis integration, avoiding data islands, consistency, and security issues.
3) Real-time data warehouse
Meet the increasingly high requirements of data timeliness in business operations, and respond in seconds.
Five, future planning
With the iterative business, our future plans for the big data platform are mainly based on two points: stream batch integration and perfect real-time data warehouse.
- The current big data platform is generally still a mixture of offline architecture and real-time architecture. The redundant offline code architecture will be discarded in the future, with the help of Flink's stream-batch integrated unified computing engine.
- In addition, we have only migrated part of the business at present, so we will refer to the previously perfect offline data warehouse indicator system to meet our current real-time data warehouse construction and fully migrate to the 2.0 Flink+Hologres architecture.
Through future planning, we hope to build a more complete real-time data warehouse together with Flink full hosting and Hologres, but we also have a further need for it here:
(1) Demand for Flink full hosting
The SQL editor in Flink's full hosting is very efficient and convenient to write FlinkSQL jobs, and it also provides many common SQL upstream and downstream connectors to meet development needs. However, there are still some requirements that I hope Flink full hosting will support in subsequent iterations:
- SQL job version control and compatibility monitoring;
- SQL jobs support Hive3.X integration;
- DataStream job packaging is more convenient, resource package upload speed is faster;
- The tasks deployed in the Session cluster mode support automatic tuning.
(2) The demand for Hologres interactive analysis
Hologres not only supports real-time writing and query with high concurrency, but is also compatible with the PostgreSQL ecosystem, which facilitates access to unified data services. However, there are still some requirements that Hologres can support in later iterations:
- Support hot upgrade operations to reduce the impact on the business;
- Support data table backup, support read-write separation;
- Support accelerated query of Alibaba Cloud EMR-Hive data warehouse;
- Support computing resource management for user groups.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright, and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。