Introduction to Flink's containerization practice application and productization experience in Vipshop.

Since 2017, Vipshop has built a high-performance, stable, reliable, and easy-to-use real-time computing platform based on k8s to support the smooth operation of Vipshop's internal business in normal times and on major promotions. The current platform supports mainstream frameworks such as Flink, Spark, and Storm. This article mainly shares Flink's containerization practice application and productization experience. content include:

  1. Development overview
  2. Flink containerization practice
  3. Flink SQL platform construction
  4. Applications
  5. future plan

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

1. Development overview

The platform supports real-time computing applications in all departments within the company. The main business includes real-time large screen, recommendation, experiment platform, real-time monitoring and real-time data cleaning, etc.

1.1 Cluster size

image.png

The platform currently has two remote computer rooms and two clusters, with more than 2,000 physical machine nodes, and uses k8s namespaces, labels and taints to achieve business isolation and preliminary computing load isolation. There are currently about 1,000 online real-time applications, and the platform recently mainly supports the launch of Flink SQL tasks.

1.2 Platform architecture

image.png

  • The picture above is the overall architecture of Vipshop's real-time computing platform.
  • The bottom layer is the resource scheduling layer of computing task nodes. It actually runs on k8s in deployment mode. Although the platform supports yarn scheduling, yarn scheduling shares resources with batch tasks, so mainstream tasks still run on k8s.
  • The storage layer supports the company's internal kafka real-time data vms, binlog-based vdp data and native kafka as the message bus. The state is stored on hdfs, and the data is mainly stored in redis, mysql, hbase, kudu, clickhouse, etc.
  • Computing engine layer, the platform supports Flink, Spark, Storm mainstream framework containerization, and provides some framework packages and components. Each framework will support several versions of images to meet different business needs.
  • The platform layer provides job configuration, scheduling, version management, container monitoring, job monitoring, alarms, logs and other functions, provides multi-tenant resource management (quota, label management), and provides kafka monitoring. Before Flink version 1.11, the platform's self-built metadata management system was the Flink SQL management schema. Starting from version 1.11, it integrates with the company's metadata management system through hive metastore.

The top layer is the application layer of each business.

Second, Flink containerization practice

2.1 Containerization Practice

image.png

The figure above is the containerized architecture of the real-time platform Flink. Flink containerization is deployed based on the standalone mode.

  • The deployment mode has three roles: client, jobmanager and taskmanager, and each role is controlled by a deployment.
  • Users upload task jar packages, configuration, etc. through the platform, and store them on hdfs. At the same time, the configuration and dependencies maintained by the platform are also stored in hdfs. When the pod is started, initialization operations such as pulling will be performed.
  • The main process of the client is an agent developed by go. When the client starts, it will first check the cluster status. When the cluster is ready, pull the jar package from hdfs and submit tasks to the Flink cluster. At the same time, the main function of the client is to monitor task status and perform operations such as savepoint.
  • The metrics of the collection container are written to m3 through the smart-agent deployed on each physical machine, and the metrics are written to prometheus through the leaked interface of Flink, combined with grafana for display. Similarly, through the vfilebeat deployed on each physical machine, the relevant logs collected and mounted are written to es, and log retrieval can be realized in dragonfly.

■ Flink platformization

In the course of practice, platform work was done in combination with specific scenarios and ease of use considerations.

  • The task configuration of the platform is decoupled from mirroring, Flink configuration, and custom components. At this stage, the platform supports versions such as 1.7, 1.9, 1.11, 1.12.
  • The platform supports pipeline compilation or upload of jars, job configuration, alarm configuration, life cycle management, etc., thereby reducing user development costs.
  • The platform has developed container-level page-based functions such as flame graphs for tuning and diagnosis, as well as the function of logging in to the container to support users in job diagnosis.

■ Flink stability

In the process of application deployment and operation, exceptions will inevitably occur. The following are the strategies used by the platform to ensure the stability of the task after an abnormal situation occurs.

  • The health and availability of the pod is detected by livenessProbe and readinessProbe, and the restart strategy of the pod is specified at the same time.
  • When the Flink task is abnormal:

    1. Flink's native restart strategy and failover mechanism are used as the first layer of guarantee.
    2. In the client, the status of Flink will be monitored regularly, and the latest checkpoint address will be updated to its own cache, reported to the platform, and solidified in MySQL. When Flink can no longer restart, the client resubmits the task from the latest successful checkpoint. As a second layer of assurance. After this layer solidifies the checkpoint into MySQL, the Flink HA mechanism is no longer used, and the zk component dependency is less.
    3. When the current two layers cannot be restarted or the cluster is abnormal, the platform will automatically re-pull a cluster from the latest chekcpoint solidified in MySQL and submit the task as the third layer guarantee.

  • Computer room disaster recovery:

    • Both the user’s jar package and checkpoint are stored in a remote dual HDFS
    • Remote dual computer rooms and dual clusters

2.2 Kafka monitoring solution

Kafka monitoring is a relatively important part of our task monitoring. The overall monitoring process is as follows.

image.png

The platform provides configuration information such as monitoring Kafka accumulation, consumption of messages, etc. After extracting user Kafka monitoring configuration from MySQL, monitoring Kafka through jmx, writing to downstream Kafka, and then real-time monitoring through another Flink task, and writing these data to ck at the same time, So as to show it to users.

Three, Flink SQL platform construction

After the implementation of Flink containerization based on k8s, it facilitates the release of Flink api applications, but it is still not convenient enough for the tasks of Flink SQL. Therefore, the platform provides a more convenient one-stack development platform such as online editing and publishing, SQL management, etc.

3.1 Flink SQL solution

image.png

The platform's Flink SQL solution is shown in the figure above, and the task publishing system is completely decoupled from the metadata management system.

■ Flink SQL task publishing platform

In the course of practice, combined with the consideration of ease of use, the platformization work was done. The main operation interface is shown in the following figure:

  • Flink SQL version management, syntax verification, topology map management, etc.;
  • UDF general and task-level management, support user-defined UDF;
  • Provides a parameterized configuration interface to facilitate users' online tasks.

image.png

image.png

■ Metadata management

Before 1.11, the platform built its own metadata management system UDM, MySQL stored kafka, redis and other schemas, and opened up Flink and UDM through custom catalogs to achieve metadata management. After 1.11, Flink integrated hive gradually improved, and the platform restructured the FlinkSQL framework. By deploying a SQL-gateway service service, the SQL-client jar package maintained by itself was called in the middle to connect with offline metadata, realizing the unification of real-time offline metadata. , Do a good job for the subsequent flow batch integration. The Flink table operation interface created in the metadata management system is shown below. The metadata of the Flink table is created and persisted in hive. When Flink SQL is started, the table schema information of the corresponding table is read from hive.

image.png

3.2 Flink SQL related practices

The platform integrates and develops connectors that are natively supported or not supported by the official platform, and decouples mirroring and connector, format and other related dependencies, allowing quick updates and iterations.

■ FLINK SQL related practices

image.png

  • At the connector layer, the platform currently supports connectors that are officially supported, and has built connectors inside platforms such as redis, kudu, clickhouse, vms, and vdp. The platform has built an internal pb format to support the reading of protobuf cleaning data in real time. The platform has built internal catalogs such as kudu and vdp to support direct reading of related schemas without creating ddl.
  • The platform layer is mainly in UDF, adjustment of common operating parameters, and upgrade of hadoop3.
  • The runntime layer mainly supports topological graph execution plan modification, dimension table association keyBy cache optimization, etc.

■ Topological diagram execution plan modification

In view of the problem that the parallelism of the stream graph generated by SQL cannot be modified at this stage, the platform provides a modifiable preview of the topology to modify related parameters. The platform will provide the parsed FlinkSQL excution plan json to users, use uid to ensure the uniqueness of the operator, modify the parallelism of each operator, chain strategy, etc., and provide users with methods to solve the backpressure problem. For example, for clickhouse sink small concurrency and large batch scenarios, we support to modify clickhouse sink parallelism, source parallelism = 72, sink parallelism = 24, and increase clickhouse sink tps.

image.png

■ Dimension table association keyBy optimizes cache

For the situation of dimension table association, in order to reduce the number of IO requests and reduce the pressure of reading the dimension table database, thereby reducing latency and improving throughput, there are several measures:

  • When the amount of dimension table data is not large, the full dimension table data is cached locally, and when ttl controls the cache refresh, this can greatly reduce the number of IO requests, but it will require more memory space.
  • When the amount of data in the dimension table is large, the async and LRU cache strategies, while ttl and size are used to control the invalidation time and cache size of the cached data, which can increase the throughput rate and reduce the read pressure of the database.
  • When the amount of data in the dimension table is large and the mainstream qps is high, you can turn on the key of the dimension table join as a hash condition to partition the data, that is, the partition strategy in the calc node is hash, so that the dimension of the subtask of the downstream operator The table data is independent, which can not only increase the hit rate, but also reduce the memory usage space.

image.png

Before optimization, the dimension table associates the LookupJoin operator and the normal operator chain together.

image.png

The lookupJoin operator and the normal operator are not chained together in the dimension table association between optimizations, and the join key is used as the key of the hash strategy. After optimization in this way, for example, the original dimension table with 3000W data volume and 10 TM nodes, each node needs to cache 3000W data, and a total of 3000W * 10 = 300 million is needed. After optimization by keyBy, each TM node only needs to cache 3000W / 10 = 300W of data, and the total cached data is only 3000W, which greatly reduces the amount of cached data.

■ Dimension table association delay join

In the dimension table association, there are many business scenarios. Before adding data to the dimension table data, the mainstream data has already undergone a join operation, and the association may not be connected. Therefore, in order to ensure the correctness of the data, the unrelated data is cached and delayed join is performed.

The simplest way is to set the number of retries and retry interval in the function associated with the dimension table. This method will increase the delay of the entire stream, but the problem can be solved if the mainstream qps is not high.

Increase the operator of delayed join. When the join dimension table is not associated, it is cached first, and the delayed join is performed according to the set number of retries and retry interval.

Four, application cases

4.1 Real-time data warehouse

■ Real-time data warehousing

image.png

  • After the first-level kafka of the traffic data is cleaned in real time, it is written to the second-level cleaning kafka, mainly in protobuf format, and then written into the hive 5min table through Flink SQL to do subsequent quasi-real-time ETL to speed up the preparation time of the ods layer data source.
  • The data of the MySQL business database is parsed by VDP to form a binlog cdc message stream, and then written into the hive 5min table through Flink SQL.
  • The business system generates the business kafka message stream through the VMS API, and writes it into the hive 5min table after parsing it through Flink SQL. Support string, json, csv and other message formats.
  • Using Flink SQL for streaming data storage is very convenient, and version 1.12 already supports the automatic merging of small files, which solves the pain points of small files.
  • We customize the partition submission strategy. When the current partition is ready, the partition submission api of the real-time platform will be adjusted, and the api will be used to check whether the partition is ready during offline scheduling.

After adopting the Flink SQL unified warehousing solution, we can obtain benefits: it can solve the unstable problem of the previous Flume solution, and users can self-service warehousing, which greatly reduces the maintenance cost of warehousing tasks. Improved the timeliness of offline warehouses, from hourly level to 5min granularity warehousing.

■ Real-time index calculation

image.png

  • After real-time application consumption and cleaning, Kafka is associated through redis dimension tables, api, etc., and then the UV is calculated incrementally through the Flink window, and the persistence is written to Hbase.
  • After real-time application and consumption of VDP message streams, they are associated through redis dimension tables, api, etc., and then related indicators such as sales are calculated through Flink SQL, and the upsert is added to kudu to facilitate batch query according to range partitions, and finally through data services to real-time The big screen provides the final service.

In the past, the index calculation usually adopted the Storm method, which required customized development through API. After adopting this Flink solution, we can get the benefits: cutting the calculation logic to Flink SQL, reducing the rapid change of the calculation task caliber, and the slow modification of the online cycle. Switching to Flink SQL can make quick modifications, go online quickly, and reduce maintenance costs.

■ Real-time offline integration of ETL data integration

image.png

Flink SQL has continued to strengthen the ability of dimension table join in the latest version. It can not only associate dimension table data in the database in real time, but also associate dimension table data in Hive and Kafka, which can flexibly meet different workloads and timeliness. demand.

Based on Flink's powerful streaming ETL capabilities, we can do data access and data conversion in the real-time layer in a unified manner, and then return the detailed layer data to the offline data warehouse.

We introduce the HyperLogLog (hereinafter referred to as HLL) used by presto into the Spark UDAF function to open up the intercommunication of HLL objects between Spark SQL and presto engines. For example, the HLL objects generated by Spark SQL through the prepare function can not only be used in Spark Merge query in SQL and merge query in presto. The specific process is as follows:

image.png

UV approximate calculation example:

Step 1: Spark SQL generates HLL object

insert overwrite dws\_goods\_uv partition (dt='${dt}',hm='${hm}') AS select goods\_id, estimate\_prepare(mid) as pre\_hll from dwd\_table\_goods group by goods\_id where dt = ${dt} and hm = ${hm}

Step 2: Spark SQL merges the HLL objects of the goods\_id dimension into the brand dimension

insert overwrite dws\_brand\_uv partition (dt='${dt}',hm='${hm}') AS select b.brand\_id, estimate\_merge(pre\_hll) as merge\_hll from dws\_table\_brand A left join dim\_table\_brand\_goods B on A.goods\_id = B.goods\_id where dt = ${dt} and hm = ${hm}

Step 3: Spark SQL query the UV of the brand dimension

select brand\_id, estimate\_compute(merge\_hll ) as uv from dws\_brand\_uv where dt = ${dt}

Step 4: presto merge to query the HLL objects generated by park

select brand\_id,cardinality(merge(cast(merge\_hll AS HyperLogLog))) uv from dws\_brand\_uv group by brand\_id

Therefore, based on the architecture of real-time offline integrated ETL data integration, the benefits we can obtain:

  • Unify basic public data sources;
  • Improve the timeliness of offline data warehouse;
  • Reduce the maintenance cost of components and links.

4.2 Experimental platform (Flink real-time data into OLAP)

Vipshop's experimental platform is an integrated platform that provides massive data A / B-test experimental effect analysis by configuring multi-dimensional analysis and drill-down analysis. An experiment is composed of a flow of traffic (such as a user request) and a modification of a comparative experiment conducted on this flow. The experimental platform has low-latency, low-response, and ultra-large-scale data (tens of billions) requirements for massive data queries. The overall data structure is as follows:

image.png

After operations such as data cleaning, analysis, and expansion in Kafka through Flink SQL, the product attributes are associated through the redis dimension table, written to clickhouse through the distributed table, and then queried through the data service adhoc. The business data flow is as follows:

image.png

Through the Flink SQL redis connector, we support the redis sink and source dimension table association operations, which can easily read and write redis, realize the dimension table association, and the cache can be configured in the dimension table association, which greatly improves the TPS of the application. Realize the pipeline of real-time data flow through Flink SQL, finally sink the large wide table into CK, and store murmurHash3\_64 according to a certain field granularity to ensure that the data of the same user is stored in the same shard node group, thereby making the ck large table The join between becomes a join between local tables, which reduces data shuffle operations and improves join query efficiency.

Five, future planning

5.1 Improve the usability of Flink SQL

At present, our Flink SQL debugging is very inconvenient. For offline hive users, there are certain thresholds for use, such as manual configuration of kafka monitoring, task pressure measurement and tuning, how can the user's threshold for use be lowered At the lowest point, it is a relatively big challenge. In the future, we will consider doing some intelligent monitoring to inform users of problems in the current task, automate as much as possible and give users some optimization suggestions.

5.2 Implementation of the Data Lake CDC Analysis Plan

At present, our VDP binlog message stream is written to the hive ods layer through Flink SQL to speed up the preparation time of the ods layer data source, but it will generate a large number of duplicate messages to de-duplicate and merge. We will consider Flink + Data Lake's cdc warehousing plan for incremental warehousing. In addition, the Kafka message flow after order widening and aggregation results require very strong real-time upsert capabilities. At present, we mainly use kudu, but the kudu cluster is relatively independent and small, and the maintenance cost is high. We will investigate the data lake Incremental upsert capability to replace the kudu incremental upsert scenario.

For more Flink-related technical questions, you can scan the code to join the community DingTalk exchange group~

image.png
Activity recommendation:

You can experience the real-time computing Flink version of Alibaba Cloud's enterprise-level product based on Apache Flink for only 99 yuan! Click the link below to learn about the event details: https://www.aliyun.com/product/bigdata/sc?utm\_content=g\_1000250506

image.png

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own the copyright, and does not bear the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。