T3 travel Apache Kyuubi Flink SQL Engine design and related practice NetEase Shufan NetEase Shufan

At the recent Apache SeaTunnel & Kyuubi joint Meetup, T3 travel big data platform leader, Apache Kyuubi committer Yang Hua, and T3 travel senior big data engineer Li Xinkai shared the latest practices and applications of Apache Kyuubi (Incubating) in T3 travel, including Based on the Flink SQL Engine designed by Kyuubi, the integration of Kyuubi and Apache Linkis, and the landing practice in T3 travel.

Status of JDBC to Flink<br>First of all, let's talk about the status of JDBC in the Apache Flink community. Flink was originally a data flow model, and then batch processing was introduced into data flow, forming a " The concept of “flow batch integration”.

Flink's batch processing is gradually developing, but compared with some mature offline computing engines, such as Hive and Spark SQL, it is not very mature. One of the manifestations is the lack of support for the JDBC specification. Of course, the Flink community has also made some efforts for this. Here, I will share with you two open source projects from Ververica, Flink's German parent company, one is called Flink SQL Gateway and the other is called Flink JDBC Driver.

The combination of these two projects enables Flink to support JDBC, but these two projects have been inactive for more than a year, and the last contribution is still in 2020. Based on Ververica's current estimates, it is unlikely that these two projects will be active again.

The design and implementation of Flink SQL Gateway and Flink JDBC Driver, which I drew based on the source code, are roughly the combination of these two components to interact with Flink to provide support for JDBC. We can divide it into three layers. The bottom JDBC Driver interacts with the services provided by the Flink SQL Gateway process through the RESTful API, using an HTTP protocol. Flink SQL Gateway can be regarded as a service wrapped around flink-client. It introduces the concept of Session internally and provides SessionManager. It can be considered that it has some share or cache capabilities for resources. The main entrance is SQLGatewayEndpoint, which is internally Many implementations of Operation are also provided.

These operations can be divided into two categories, one is local operation, which manages metadata in memory, such as catalog data; the other is JobOperation. These two types of operations interact with the Flink cluster, one is Insert and the other is Selector, which submits jobs to the Flink cluster.

We can see that these two designs and implementations are specific to Flink. Although Flink proposed the concept of stream-batch integration, as far as we know, there are not many companies that only use Flink to realize all the capabilities of stream and batch processing. Most companies are still in the coexistence of Spark and Flink engines. situation. So for some general capabilities, such as multi-tenancy or support for JDBC, we hope to provide an abstract implementation that can support engines such as Flink and Spark at the same time, and can support more engines in the future, so our consideration is The integration of Flink SQL Engine is directly based on Kyuubi.

Design and Implementation of Flink SQL Engine

The following introduces the design and implementation of Flink SQL Engine. This is an interaction diagram of a high-level abstract component I drew, which can also be divided into three levels. At the bottom are some JDBC or REST clients, they will directly interact with Kyuubi Server. Kyuubi Server is divided into two layers of implementation, the front end is the frontend layer, which provides the implementation of frontends of different protocols, and the backend layer is behind, and then the backend will interact with the frontend of the Engine.

The design of Engine in Kyuubi is actually divided into two levels: frontend and backend. In the implementation of Flink Engine, we provide FlinkThrift Frontend Service and backend layer We provide a Flink SQL Backend Service, which will interact with Flink SQL Manager and Operation Manager.

The position of Kyuubi Flink Engine in the entire Kyuubi ecosystem is in the vertical fifth row of Kyuubi Engine. We can also see the community’s upcoming support for Trino (Presto SQL). Overall, Kyuubi’s support for other technologies The attitude is actually very open. It does not require us to make many changes to other components to adapt to this system. This is particularly friendly to heterogeneous architectures in enterprise technology selection.

Flink SQL QuickStart DEMO

Next, let's look at the demo of Flink SQL QuickStart. The first step is to build a binary release package based on Kyuubi's source code, because Flink SQL Engine has not been included in the release package before version 1.5.0. (Currently Kyuubi 1.5.0 version has been released, including the support of Flink SQL Engine (Beta), see for details
https://kyuubi.apache.org/release/1.5.0-incubating.html )

This is a demo of the Mac native, so we need to add the Hadoop classpath to the PATH environment variable and start a Flink local cluster. The next step is to start the Kyuubi Server, and then use the Beeline that comes with Kyuubi to connect to the Flink Engine. Kyuubi will automatically pull up a Flink Engine, which is a Java process.

Enter into the interactive command line of Beeline, after selecting a specific database, you can perform some DDL and DML operations. Here our example is to create a table, and then call an Insert statement. After the execution is over, we open Flink's WebUI, and we can see that an Insert Job has been submitted and the execution has been completed.

This demo is relatively simple. Recently, a small partner in the community has provided a QuickStart support and documentation in the on YARN mode. Interested partners can also learn about the submission method of on YARN based on this document.

The implementation of Flink SQL Engine is inseparable from the strong support of Flink community leaders. I would like to thank Jiang Xiaofeng from Alibaba's Blink team and Lin Xiaobai from NetEase's game real-time team. They have been active in the community and contributed a lot of PR.

I have roughly listed some of the functions currently provided by Flink SQL Engine:

Conventional DDL and DML operations are basically supported
Some Get or Show related methods of the JDBC specification are also supported
Some methods related to setting or resetting properties implemented by Flink
UDF support currently supported deploy mode Flink Session Standalone / on YARN mode
Flink SQL Engine Outlook
The future plan of Flink SQL Engine, first of all, its deploy mode will have a big change, mainly to provide support for Application mode and Session mode, the most used in the industry, and the relatively mature Per-Job mode will be abandoned Lose. Kyuubi's Flink Engine should mainly focus on Application mode in the future, and Session mode is also supported, but it is not the first choice.

Some other plans, including the support of Flink SQL Engine on YARN(application), the support of Flink SQL Engine on Kubernetes, the enhancement of the Kyuubi sharing level, and the improvement of some usage, such as supporting the management of JARs in Session . Of course, there are still some that may need feedback after you use it, and you are welcome to participate and contribute together.

Why Kyuubi

The following introduces some application scenarios of Kyuubi in T3 travel.

T3 Mobility is a platform driven by the Internet of Vehicles. It is mainly based on the data of the Internet of Vehicles. Based on the diversity of Internet of Vehicles data, T3 Mobility has built an enterprise-level data lake platform based on Apache Hudi, and on top of this A series of platforms such as BI platform, task scheduling, machine learning, data quality, etc. have been built to provide support for the business. With the development of business, there are more and more platforms, and the unified management of these platforms is becoming more and more complicated, and the use experience of business partners also deteriorates. After a series of research and selection, we decided to choose the open source DSS (DataSphere Studio) of WeBank as the one-stop interactive management platform for data applications, and carried out some customized development according to the company's actual scenarios.

The figure below shows the architecture before DSS was introduced into Kyuubi. We locate data through Kafka and CANAL, and then the data is stored in object storage and stored on the data lake in Hudi format. The resource orchestration is YARN, and the computing engines are mainly Spark, Hive and Flink. The computing engine manages the interaction in a unified way through the computing middleware Linkis, and at the same time, it has made a connection with DSS. On the Linkis computing middleware, a BI platform and data map, data development, machine learning, etc. are built. DSS is a one-stop portal for unified access management.

This architecture has encountered some problems in actual use, such as cross-storage problems. Now data is distributed in OBS object storage, Hudi format storage, and different mature data such as ClickHouse and MongoDB. Development partners need to write various Code correlation analysis, or ETL import, Linkis is still relatively limited in solving this. There are also inconsistencies in SQL syntax . For example, Hive or native Spark SQL does not support syntax operations such as upsert, update, and delete, and the syntax of MongoDB and ClickHouse are also different, resulting in high development and conversion costs. At the same time , Linkis and Hive and Spark versions are strongly coupled. If you upgrade the Spark version, you need to modify a series of source code, recompile, and the upgrade is more difficult; at the same time, the Spark engine in Linkis is very difficult for cluster operation mode, AQE, dynamic resources, etc. The support of features is not perfect, and the cost of transformation is relatively large.

So on top of this, we introduced Apache Kyuubi. T3 travel has used Kyuubi for a long time, so we decided to investigate to see if we can connect Kyuubi and Linkis to each other. Kyuubi is a unified Thrift JDBC service, which is connected to Spark engine and Flink engine. Trino community has also made some connections, so it can manage multiple engines, which can meet some scenarios of BI, ETL and ad-hoc. The Apache incubator provides a standardized interface for enterprise-level data lake exploration, giving users the ability to mobilize the entire data lake data, allowing users to process big data like ordinary data. It is a serverless service.

Compare Kyuubi with Linkis, Hive on Spark, as shown in the following table, Hive and Kyuubi are both JDBC interfaces, Linkis provides some HTTP REST APIs, Linkis is mainly based on the syntax such as Spark engine or HiveQL, Kyuubi is mainly Spark SQL or Flink SQL syntax, Hive or its own HQL syntax.

SQL parsing Hive is the server side, and Linkis and Kyuubi are both the engine side. Task submission, Hive splits multiple RemoteDriver submissions, Linkis based on a distributed thread scheduling on the server side, Kyuubi has its own resource isolation mechanism, and isolates resources through USER, GROUP or CONNECTION strategies.

The versions of Linkis, Spark and Hive are bound. Kyuubi is more flexible and supports multi-version adaptation. It is separated from the server side and the engine side, which is more convenient for us to upgrade and update iterations. In terms of computing resources, Linkis is based on its own engine management. At the same time, it will lock resources when it is used. Kyuubi is based on YARN and Kubernetes resource scheduling. On top of this, resources are managed through the dimension of Engine. , much more flexible than Linkis.

Linkis integrated Kyuubi practice process

The following describes the integration process of Linkis and Kyuubi. Linkis supports adding a custom engine and supports a variety of engine types. We have selected one of them.
ComputationExecutor: This type integrates its methods, because it is a commonly used interactive engine Executor, which can handle interactive execution tasks and has interactive capabilities such as status query and task kill. The Kyuubi engine is an interactive engine, so it is more appropriate to implement this Execute on top of it.

To implement this engine, the main reason is that the com file needs to be introduced into a related package of Linkis. For details, please refer to Linkis' official document "How to quickly implement a new underlying engine".

We want to implement these modules. KyuubiEngineConnPlugin is a connection entry for starting the engine; KyuubiEngineConnFactory is an overall logic to implement the management of an engine and start the engine; the KyuubiEngineLaunchBuilder module is used to encapsulate the engine-side management and parse the startup command; the actual The execution scenario is KyuubiExecutor, which directly interacts with Kyuubi Server to realize the execution unit of computing logic. It is mainly the implementation of these large blocks. The general code structure is shown in the following figure.

The interaction between Linkis startup and Kyuubi is mainly through a management module of the Gateway forwarding engine of Linkis, and the engine management module of Linkis will start the connector management of the engine. engine to start the execution of the Kyuubi engine. The Kyuubi engine then establishes a session with the Kyuubi Server, and then through interaction with the Kyuubi Server, such as query or DDL operations, the returned results will be stored in a HDFS temporary directory, which will return the previous query. The result is given to the Gateway, which then returns to an actual client of the user.

After the introduction of Kyuubi, the overall architecture is as shown below. The main change is the computing middleware, which is jointly implemented by Kyuubi and Linkis. The SQL modules are all managed by Kyuubi, and the other modules are managed by Linkis, and are connected with task scheduling and computing engines. In the future, we hope to integrate tasks such as Scala or Flink on Kyuubi.

Kyuubi in one-stop platform usage scenarios

In the actual usage scenario of Kyuubi in the one-stop platform, you can see the DSS data development module. We directly added a type of Kyyubi. The type of Kyuubi directly connects to the Kyuubi service, which can be used for data development through some SQL statements, and realize one-stop development and CI/CD management.

The developed script can be tasked. The Kyuubi type is integrated in the task arrangement, and Kyuubi can be directly used as a component to perform task arrangement and associate the script that has been written. These written scripts have a publishing function. This publishing function is connected with DSS. When published on DSS, it is equivalent to a SQL module, that is, some scripts that have been written are published to the Kyuubi data source. Published to DS, this forms a complete CI/CD process.

In the past, there was a series of WebUI monitoring and management for Linkis on DSS, but Kyuubi did not have this, so we strengthened Kyuubi's background management function on this basis, and developed a Kyuubi Web server module separately. Unified management and monitoring of user operations, mainly monitoring the number of connections, the number of engines, and the JVM on the Kyuubi Server side.

There is also the Session of the user session that can be connected. This Session is an interface to directly call the Server-side Session API to obtain some Session status, and then save it to a MySQL persistent storage. At the same time, you can manually adjust the Server-side API to close the Session.

In addition, some statements submitted by users can also be displayed or managed in a unified manner, which is convenient for background administrators to control user operations, and it is also more convenient to trace back problems.

To sum up, after the introduction of Apache Kyuubi to the T3 travel big data platform, it complements the functions of Linkis, realizes the integration of code development, business online and scheduling system, and at the same time, it can be closed to achieve CI/CD management of big data development, helping business departments to go online with a low threshold The requirements related to big data have relieved the pressure of data development and made a step closer to our goal of a one-stop development platform. It is also expected that Apache kyuubi and Linkis will become better and better as the leaders of computing middleware!

Authors: Yang Hua, Li Xinkai

With video playback and PPT download:

T3 travel Apache Kyuubi FlinkSQLEngine design and related practice

T3 travel Apache Kyuubi Flink SQL Engine design and related practice NetEase Shufan NetEase Shufan

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目