Introduction to 's containerized practical application of Flink, Flink SQL platform construction, and application cases on real-time data warehouses and experimental platforms.
Transfer from dbaplus community public account
Author: Wang Kang, Senior Development Engineer of Vipshop Data Platform
GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~
Since 2017, in order to ensure the smooth operation of internal business in normal times and during the big promotion period, Vipshop has begun to build a high-performance, stable, reliable, and easy-to-use real-time computing platform based on Kubernetes. The current platform supports Flink, Spark, Mainstream frameworks such as Storm.
This article will be divided into five areas to share the practical application and productization experience of Vipshop Flink's containerization:
- Development overview
- Flink containerization practice
- Flink SQL platform construction
- Applications
- future plan
1. Development overview
1. Cluster size
In terms of cluster scale, we have 2000+ physical machines, mainly deploy Kubernetes remote active-active clusters, and use Kubernetes namespaces, labels, and taints to achieve business isolation and preliminary computing load isolation.
The number of Flink tasks, the number of Flink SQL tasks, the number of Storm tasks, and the number of Spark tasks. These online real-time applications add up to more than 1,000. At present, we mainly support Flink SQL. Because SQLization is a trend, we need to support the online platform of SQL tasks.
2. Platform architecture
We analyze the overall architecture of the real-time computing platform from bottom to top:
- resource scheduling layer (bottom layer)
In fact, Kubernetes is run in deployment mode. Although the platform supports yarn scheduling, yarn scheduling and batch tasks share resources, so mainstream tasks are still running on Kubernetes. Moreover, the yarn scheduling layer is mainly a set of yarn clusters deployed offline. In 2017, we self-developed a set of solutions for Flink on Kubernetes. Because the underlying scheduling is divided into two layers, when resources are tight in the promotion, a real-time and offline secondment of resources can be made.
- storage layer
Mainly used to support the company's internal Kafka-based real-time data vms, binlog-based vdp data and native Kafka as the message bus, the state is stored on HDFS, and the data is mainly stored in Redis, MySQL, HBase, Kudu, HDFS, ClickHouse, etc.
- Computing engine layer
Mainly Flink, Storm, Spark, and currently the main product is Flink. Each framework will support several versions of mirroring to meet different business needs.
- Real-time platform layer
It mainly provides job configuration, scheduling, version management, container monitoring, job monitoring, alarms, logs and other functions, and provides multi-tenant resource management (quota, label management) and Kafka monitoring. Resource allocation is also divided into big promotion days and normal days. The resources for big promotion and normal resources are different, and the authority control of resources is also different. Before Flink version 1.11, the platform's self-built metadata management system was the Flink SQL management schema; from version 1.11, it was integrated with the company's metadata management system through the Hive metastore.
- Application layer
It mainly supports some scenarios of real-time large screen, recommendation, experimental platform, real-time monitoring and real-time data cleaning.
2. Flink containerization practice
1. Containerization plan
The above is the architecture diagram of the real-time platform Flink containerization. Flink containerization is actually deployed based on the Standalone mode.
Our deployment model has three roles: Client, Job Manager, and Task Manager. Each role will be controlled by a Deployment.
Users upload task jar packages, configurations, etc. through the platform, and store them on HDFS. At the same time, the configuration and dependencies maintained by the platform are also stored on HDFS. When the pod starts, initialization operations such as pull will be performed.
The main process in the Client is an agent developed by go. When the Client is started, it will first check the cluster status. When the cluster is ready, pull the jar package from HDFS and submit tasks to the cluster. The main task of the Client is to perform fault tolerance, and its main function is to monitor task status and perform operations such as savepoint.
The metrics of the smart-agent collection container deployed on each physical machine are written to m3, and the metrics are written to prometheus through the leaked interface of Flink, combined with grafana. Similarly, through the vfilebeat deployed on each physical machine, the relevant logs collected and mounted are written to es, and log retrieval can be realized in dragonfly.
1) Flink platformization
In the course of practice, we must combine specific scenarios and ease of use, and then consider doing platform work.
2) Flink stability
In the process of our application deployment and operation, exceptions are inevitable. At this time, the platform needs to do some strategies to ensure that the task still maintains stability after the abnormal situation occurs.
health and availability :
Detected by livenessProbe and readinessProbe, and at the same time specifies the restart strategy of the pod, Kubernetes itself can do a pod pull up.
When an exception occurs in the Flink task :
- Flink has its own set of restart strategy and failover mechanism, which is its first layer of protection.
In the Client, the status of Flink will be monitored regularly, and the latest checkpoint address will be updated to its own cache, reported to the platform, and then solidified in MySQL. When Flink can no longer restart, the client resubmits the task from the latest successful checkpoint. This is its second layer of protection.
After this layer solidifies the checkpoint into MySQL, the Flink HA mechanism is no longer used, and the zk component dependency is less.
- When the first two layers cannot be restarted or the cluster is abnormal, the platform will automatically re-pull a cluster from the latest checkpoint solidified in MySQL and submit the task. This is its third layer of guarantee.
computer room disaster recovery :
- Both the user's jar package and checkpoint are stored in dual HDFS in different places.
- Double clusters in remote double computer rooms.
2. Kafka monitoring solution
Kafka monitoring is a very important part of task monitoring. The overall process is as follows:
The platform provides monitoring Kafka accumulation. On the interface, users can configure their own Kafka monitoring, telling which cluster they are in, and configuration information such as user consumption messages. You can extract the user Kafka monitoring configuration from MySQL, and then monitor Kafka through jmx. After such information is collected, it is written to downstream Kafka, and then another Flink task is used to monitor alarms in real time. At the same time, these data are synchronously written into ck, thereby Give feedback to our users (you can also use Prometheus for monitoring, but ck will be more suitable), and finally use the Grafana component to show it to users.
Three, Flink SQL platform construction
With the previous Flink containerization solution, it is time to start the Flink SQL platform construction. Everyone knows that there is still a certain cost to develop such a streaming API. Flink is definitely faster than Storm, and it is relatively stable and easier, but for some users, especially some students in Java development, there is a certain threshold to do this.
After the implementation of Flink containerization of Kubernetes, it facilitates the release of Flink api applications, but it is still not convenient for Flink SQL tasks. Therefore, the platform provides a more convenient one-stack development platform such as online editing and publishing, SQL management, etc.
1. Flink SQL solution
The platform's Flink SQL solution is shown in the figure above. The task publishing system and the metadata management system are completely decoupled.
1) Flink SQL task publishing platform
In the process of practice, it is necessary to consider ease of use and do platform-based work. The main operation interface is shown in the following figure:
- Flink SQL version management, syntax verification, topology map management, etc.;
- UDF general and task-level management, support user-defined UDF;
- Provides a parameterized configuration interface to facilitate users' online tasks.
The following figure is an example of user interface configuration:
The following figure is an example of a cluster configuration:
2) Metadata management
Before 1.11, the platform built its own metadata management system UDM, MySQL stored Kafka, Redis and other schemas, and opened up Flink and UDM through custom catalogs to achieve metadata management.
After 1.11, Flink integrated Hive gradually improved, the platform restructured the Flink SQL framework, and deployed a SQL-gateway service service, and called the SQL-Client jar package maintained by itself in the middle, thus connecting with offline metadata, realizing real-time offline The unification of metadata lays a solid foundation for the subsequent integration of streaming and batching.
The Flink table operation interface created in the metadata management system is shown in the following figure: Create the metadata of the Flink table and persist it in Hive. When Flink SQL starts, it reads the table schema information of the corresponding table from Hive.
2. Flink SQL related practices
The platform integrates and develops connectors that are natively supported or not supported by the official platform, and decouples mirroring and connector, format and other related dependencies, allowing quick updates and iterations.
1) Flink SQL related practices
Flink SQL is mainly divided into the following three layers:
connector layer
- Support VDP connector to read source data source;
- Support sink & dimension table association of data types such as Redis string and hash;
- Support kudu connector & catalog & dimension table association;
- Support protobuf format to parse real-time cleaning data;
- Support vms connector to read source data source;
- Support ClickHouse connector sink distributed table & local table high TPS writing;
- Hive connector supports complex data types such as Shufang Watermark Commit Policy partition submission strategy & array and decimal.
runtime layer
- Mainly support the modification of the topological diagram execution plan;
- Dimension table association keyBy optimizes cache to improve query performance;
- Dimension table association delay join.
platform layer
- Hive UDF;
- Support json HLL related processing functions;
- Support Flink operation related parameter settings such as minibatch, aggregation optimization parameters;
- Flink upgraded hadoop3.
2) Topological diagram execution plan modification
In view of the problem that the parallelism of the stream graph generated by SQL cannot be modified at this stage, the platform provides a modifiable preview of the topology to modify related parameters. The platform will provide the parsed FlinkSQL excution plan json to users, use uid to ensure the uniqueness of the operator, modify the parallelism of each operator, chain strategy, etc., and provide users with methods to solve the backpressure problem. For example, for ClickHouse sink small concurrency and large batch scenarios, we support to modify the ClickHouse sink parallelism, source parallelism = 72, sink parallelism = 24, and increase ClickHouse sink tps.
3) Dimension table association keyBy optimized cache
In the case of dimension table association, in order to reduce the number of IO requests and reduce the pressure of reading the dimension table database, thereby reducing latency and increasing throughput, there are three measures:
The following is a diagram of the dimension table associated KeyBy optimized cache:
Before optimization, the dimensional table is associated with the LookupJoin operator and the normal operator chain, and the dimensional table is associated with the Lookup Join operator and the normal operator not in the chain between optimizations, and the join key is used as the key of the hash strategy.
After optimization in this way, for example, the original 3000W data volume dimension table, 10 TM nodes, each node needs to cache 3000W data, a total of 300 million is needed. After optimization by keyBy, each TM node only needs to cache 3000W/10 = 300W of data, and the total cached data is only 3000W, which greatly reduces the amount of cached data.
4) Dimension table association delayed join
In the dimension table association, there are many business scenarios. Before adding data to the dimension table data, the mainstream data has already undergone a join operation, and the association may not be connected. Therefore, in order to ensure the correctness of the data, the unrelated data is cached and delayed join is performed.
The simplest way is to set the number of retries and retry interval in the function associated with the dimension table. This method will increase the delay of the entire stream, but the problem can be solved if the mainstream qps is not high.
Increase the operator of delayed join. When the join dimension table is not associated, it is cached first, and the delayed join is performed according to the set number of retries and retry interval.
4. Application case
1. Real-time data warehouse
1) Real-time data
The real-time data warehouse is mainly divided into three processes:
- After the first level Kafka of the traffic data performs real-time data cleaning, it can be written to the second level cleaning Kafka, mainly in protobuf format, and then written into the Hive 5min table through Flink SQL to do subsequent quasi-real-time ETL and accelerate the preparation time of the ods layer data source .
- The data of the MySQL business database is parsed by VDP to form a binlog cdc message stream, and then written into the Hive 5min table through Flink SQL. At the same time, it will be submitted to the custom partition, and then the partition status will be reported to the service interface, and finally an offline schedule will be performed.
- The business system generates the business Kafka message stream through the VMS API, and writes it into the Hive 5min table after parsing it through Flink SQL. It can support string, json, csv and other message formats.
It is very convenient to use Flink SQL for streaming data warehousing, and version 1.12 already supports the automatic merging of small files, which solves a very common pain point in the big data layer.
We customize the partition submission strategy. When the current partition is ready, the partition submission api of the real-time platform will be adjusted, and the api will be used to check whether the partition is ready during offline scheduling.
After adopting the Flink SQL unified warehouse entry program, we can obtain the following results:
- First of all, we have not only solved the unstable problem of the previous Flume solution, but users can also realize self-service warehousing, which greatly reduces the maintenance cost of warehousing tasks, and stability can also be guaranteed.
- Secondly, we have also improved the timeliness of offline warehouses, from hourly level to 5min granularity warehousing, and timeliness can be enhanced.
2) Real-time index calculation
- After real-time application consumption and cleaning, Kafka is associated through Redis dimension tables, api, etc., and then the UV is calculated incrementally through the Flink window, and the persistence is written to HBase.
- After real-time application and consumption of VDP message streams, they are associated through Redis dimension tables, api, etc., and then related indicators such as sales are calculated through Flink SQL, and the upsert is increased to kudu, which is convenient for batch query according to the range partition, and finally through the data service to real-time The big screen provides the final service.
In the past, the calculation of indicators usually adopted the Storm method, which needed to be developed through api customization. After adopting this Flink solution, we can obtain the following results:
- Switch the calculation logic to Flink SQL, reduce the calculation task caliber to change quickly, and solve the problem of slow modification and online cycle;
- Switching to Flink SQL can achieve rapid modification, and achieve rapid online, reducing maintenance costs.
3) Real-time offline integration of ETL data integration
The specific process is shown in the figure below:
Flink SQL has continued to strengthen the ability of dimension table join in recent versions. It can not only correlate dimension table data in the database in real time, but also correlate dimension table data in Hive and Kafka, which can flexibly meet the needs of different workloads and timeliness .
Based on Flink's powerful streaming ETL capabilities, we can do data access and data conversion in the real-time layer in a unified manner, and then return the detailed layer data to the offline data warehouse.
We introduced the HyperLogLog (hereinafter referred to as HLL) used by presto into the Spark UDAF function to open up the intercommunication of HLL objects between Spark SQL and the presto engine. For example, the HLL objects generated by Spark SQL through the prepare function can not only merge queries in Spark SQL but also merge queries in presto.
The specific process is as follows:
UV approximate calculation example:
2. Experimental platform (Flink real-time data into OLAP)
Vipshop's experimental platform is an integrated platform that provides massive data A/B-test experimental effect analysis by configuring multi-dimensional analysis and drill-down analysis. An experiment is composed of a flow of traffic (such as a user request) and a modification of a comparative experiment conducted on this flow. The experimental platform has low-latency, low-response, and ultra-large-scale data (tens of billions) requirements for massive data queries.
The overall data structure is as follows:
- Offline data is imported into ClickHouse through waterdrop;
- After real-time data is cleaned, parsed, and other operations in Kafka through Flink SQL, the product attributes are associated through Redis dimension tables, written to ClickHouse through distributed tables, and then queried through data service adhoc, and external interfaces are provided through data services.
The business data flow is as follows:
Our experimental platform has a very important ES scenario. After we launch an application scenario, if I want to see the effect, including the exposure, clicks, additional purchases, and collections generated by the launch. We need to write the details of each data, such as some data that is diverted, into ck according to the scene partition.
Through the Flink SQL Redis connector, we support Redis's sink and source dimension table association operations, which can easily read and write Redis, realize dimension table association, and configure cache in dimension table association, which greatly improves the TPS of the application. Realize the pipeline of real-time data flow through Flink SQL, and finally sink the large wide table into CK, and store murmurHash3\_64 according to a certain field granularity to ensure that the data of the same user is stored in the same shard node group, thereby making the ck large table The join between becomes a join between local tables, reducing data shuffle operations and improving join query efficiency.
5. Future planning
1. Improve the ease of use of Flink SQL
Flink SQL is a little different for Hive users. Whether it is Hive or Spark SQL, it is a scenario of batch processing.
Therefore, currently our Flink SQL debugging still has many inconveniences. For users who do offline Hive, there are certain usage thresholds, such as manual configuration of Kafka monitoring and task pressure measurement and tuning. Therefore, how to reduce the user's threshold to use, so that users only need to understand SQL or understand business, shielding the concepts in Flink SQL from users, and simplifying the user's use process is a relatively big challenge.
In the future, we will consider doing some intelligent monitoring to inform users of the problems in the current task, without requiring users to learn too many things, automate as much as possible and give users some optimization suggestions.
2. Implementation of the Data Lake CDC analysis plan
On the one hand, our data lake is mainly to solve the real-time update scenario of our binlog. At present, our VDP binlog message stream is written to the Hive ods layer through Flink SQL to speed up the preparation time of the ods layer data source, but it will generate a lot of Duplicate messages to de-duplicate and merge. We will consider Flink + Data Lake's cdc warehousing plan for incremental warehousing.
On the other hand, we hope to use the data lake to replace our Kudu, and part of our important business here is using Kudu. Although Kudu is not widely used, Kudu's operation and maintenance is much more complicated than general database operation and maintenance, and is relatively small, and Kafka message flow after order widening and aggregation results require very strong real-time upsert capabilities. So we started to investigate the CDC+data lake solution, and replaced the kudu incremental upsert scenario with the incremental upsert capability of this solution.
Q&A
Q1: Is the vdp connector used for MySQL binlog reading? Is and canal a tool?
A1: vdp is a component of the company's binlog synchronization, and the binlog is parsed and sent to Kafka. It is based on the secondary development of canal. We have defined a CDC format that can connect to the company's vdp Kafka data source, which is similar to Canal CDC format. There is currently no open source, a synchronization solution for binlog used by our company.
Q2: UV data is output to HBase, sales data is output to kudu, and output to different data sources. What is the main reason for this strategy?
A2: kudu does not have as extensive application scenarios as HBase. The real-time writing of uv has a relatively high TPS, and HBase is more suitable for single query scenarios. Writing to HBase has high throughput + low latency, and small-scale query latency is low; kudu has some OLAP features, which can store order details and accelerate column storage. Combine Spark, presto, etc. to do OLAP analysis.
Q3: Excuse me, how do you solve the ClickHouse data update problem? Such as data indicator update.
A3: ck is an asynchronous merge, which can only be merged asynchronously in the same partition on the same shard and the same node, which is weak consistency. It is not recommended to use ck for indicator update scenarios. If there is a strong demand for update in ck, you can try the AggregatingMergeTree solution, replace update with insert, and do a field-level merge.
Q4: How does binlog write to ensure data deduplication and consistency?
A4: binlog has not yet written the ck scene, this scheme seems not mature. It is not recommended to do this, you can use a solution that uses CDC + data lake.
Q5: If ck writes unevenly among nodes, how to monitor and how to solve it? How to look at the data skew?
A5: can monitor the write data volume and size of each partition of each table of each machine through ck's system.parts local table, to view the data partition, so as to locate a certain table, a certain machine, a certain partition.
Q6: How do you perform task monitoring or health checks on the real-time platform? How does it automatically recover after an error? Are you using yarn-application mode? Does one yarn application correspond to multiple Flink jobs?
A6: For Flink version 1.12+, it supports PrometheusReporter to expose some Flink metrics, such as the operator's watermark, checkpoint-related indicators such as size, time-consuming, failures and other key indicators, and then collect and store them for task monitoring Alarm.
Flink's native restart strategy and failover mechanism, as the first layer of guarantee.
In the Client, the status of Flink will be monitored regularly, and the latest checkpoint address will be updated to its own cache, and reported to the platform, and solidified into MySQL. When Flink can no longer restart, the client resubmits the task from the latest successful checkpoint. as the second layer of guarantee. layer solidifies the checkpoint into MySQL, the Flink HA mechanism is no longer used, and the zk component dependency is missing.
When the first two layers cannot be restarted or the cluster is abnormal, the platform will automatically re-pull a cluster from the latest chekcpoint solidified in MySQL and submit the task, as the third layer guarantee.
We support the yarn-per-job mode, which is mainly based on the Flink on Kubernetes mode to deploy standalone clusters.
Q7: Are all the components on your big data platform currently containerized or mixed?
A7: At present, our real-time components Flink, Spark, Storm, Presto and other computing frameworks have realized containerization. For details, please refer to the 1.2 platform architecture above.
Q8: Kudu is not running on Kubernetes, right?
A8: kudu does not run on Kubernetes. There is currently no particularly mature solution. And kudu is based on cloudera manager operation and maintenance, there is no need to go to Kubernetes.
Flink real-time data warehouse dimension table in ck, and then query ck. Is this solution possible?
A9: this is possible, it is worth trying. Both fact table and dimension table data can be stored, and can be hashed according to a certain field (such as user\_id) to achieve the effect of local join.
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。