* 支撑了80%的离线作业,日作业量在1W+
* 大多数场景比 Hive 性能提升了3-6倍
* 多租户、并发的场景更加高效稳定
T3 Travel is a smart travel platform driven by the Internet of Vehicles, with massive and rich data sources. Because of the diversity of Internet of Vehicles data, T3 Travel has built an enterprise-level data lake based on Apache Hudi to provide strong business support. For end users who are responsible for data value mining, the technical threshold of the platform is another challenge. If the capabilities of the platform can be integrated and continuously optimized and iterated, so that users can use the most common and versatile technologies such as JDBC and SQL, data productivity will be further improved.
T3 Travel chose Apache Kyuubi (hereinafter referred to as Kyuubi), which is based on Netease Shufan's leading open source, to build this capability. At the 2021 China Open Source Annual Conference (COSCon'21), T3 Travel Senior Big Data Engineer Li Xinkai explained in detail the reasons for choosing Kyuubi, as well as the value based on Kyuubi's in-depth practice and realization.
Technical architecture before the introduction of Kyuubi
The entire data lake system of T3 travel consists of data storage and calculation, data query and analysis, and application service layers. The data calculation is divided into offline and real-time.
data storage
OBS object storage, the formatted data storage format is based on Hudi format.
Data calculation
Offline data processing: Utilize Hive on Spark batch processing capabilities and schedule regularly on Apache Dolphin Scheduler to undertake all offline data warehouse ETL and data model processing work.
Real-time data processing: A development platform based on the Apache Flink engine has been built to develop and deploy real-time operations.
Data query and analysis
The OLAP layer is mainly for reports for management and operations personnel, docking with the reporting platform, queries requiring low-latency response, and rapid response to changing needs. Ad hoc queries for data analysts require the OLAP engine to support complex SQL processing and the ability to quickly select data from massive amounts of data.
Application Service Layer
The data application layer mainly connects to various business systems. The data after offline ETL is written into different databases of different businesses to provide services for downstream.
Pain Points of Existing Architecture
Cross storage
Data is distributed in different storages such as Hudi, ClickHouse, MongoDB, etc. It is necessary to write code for correlation analysis to increase the threshold and cost of data processing.
SQL is not uniform
Hive does not support the operation of Hudi tables through syntax such as upsert, update, and delete. At the same time, the syntax of MongoDB, ClickHouse, etc. is different, and the development conversion cost is high.
Weak resource management
Hive on Spark and Spark ThriftServer do not have a good resource isolation solution, and cannot perform concurrency control based on tenant permissions.
Select Apache Kyuubi
Apache Kyuubi is a Thrift JDBC/ODBC service that connects to the Spark engine, supports multi-tenancy and distributed features, and can meet applications in various big data scenarios such as ETL and BI reports in the enterprise. Kyuubi can provide a standardized interface for enterprise-level data lake exploration, giving users the ability to mobilize the data of the entire data lake ecology, enabling users to process big data like ordinary data. The project has officially entered the Apache incubator on June 21, 2021. For T3 travel, Kyuubi's role is a service for Serverless SQL on Lakehouse.
Apache Kyuubi architecture
HiveServer is a widely used big data component. Because the processing efficiency of the traditional MR engine has fallen behind, the Hive engine is replaced with Spark. However, in order to coexist with the original MR and TEZ engines, Hive retains its own optimizer, which makes the performance of Hive Parse lag behind Spark in most scenarios. Parse.
STS (Spark Thrift Server) supports the interface and protocol of HiveServer, allowing users to directly use the Hive interface to submit SQL jobs. However, STS does not support multi-tenancy. At the same time, all Spark SQL queries go through the same Spark Driver on the only Spark Thrift node. The concurrency is too high, and any failure will cause all jobs on the only Spark Thrift node to fail, which requires a restart. Spark Thrift Server, there is a single point of problem.
Comparing Apache Kyuubi with Hive and STS, we found that Kyuubi has many advantages in tenant control, task resource isolation, engine upgrade and docking, and performance. See the figure below for details.
Apache Kyuubi advantages
Apache Kyuubi travel scene in T3
AD-HOC scenario
Hue integrates Kyuubi to replace Hive to provide services for analysts and big data development.
We add the following configuration in the hue_safety_valve.ini configuration file:
[notebook]
[[interpreters]]
[[[cuntom]]]
name=Kyuubi
interface=hiveserver2
[spark]
sql_server_host=Kyuubi Server IP
sql_server_port=Kyuubi Port
Then restart Hue.
ETL scene
DS configures the Kyuubi data source and performs offline ETL operations. Because the interface and protocol of Kyuubi Server are completely consistent with HiveServer2, DS only needs to configure the Hive data source type in the data source to Kyuubi multi-data source, and then you can directly submit the SQL task.
Currently, Kyuubi supports 80% of offline operations during T3 trips, with a daily workload of 1W+.
Federated query scenario
The company uses a variety of data storage systems internally, and these different systems solve the corresponding usage scenarios. In addition to traditional RDBMS (such as MySQL), we also use Apache Kafka to obtain stream and event data, as well as HBase, MongoDB, as well as data lake object storage and Hudi format data sources.
We know that to associate data from different storage sources, we need to extract the data and put it in the same storage medium, such as HDFS, and then perform the association operation. This kind of data fragmentation will bring great trouble to our data association analysis. If we can use a unified query engine to query data from different data sources, and then directly perform association operations, this will bring huge efficiency. promote.
Therefore, we use Spark DatasourceV2 to implement cross-storage federated queries with a unified syntax. It provides efficient and unified SQL access. The advantages of this are as follows:
* 单个 SQL 方言和 API
* 统一安全控制和审计跟踪
* 统一控制
* 能够组合来自多个来源的数据
* 数据独立性
Based on Spark DatasourceV2, for the reading program, we only need to define a DefaultSource class and implement the ReadSupport related interface to connect to external data sources. At the same time, SupportsPushDownFilters, SupportsPushDownRequiredColumns, SupportsReportPartitioning and other related optimizations realize the operator push-down function. From this, we can push down the query rules to data sources such as JDBC, perform some filtering at different data source levels, and then return the calculation results to Spark, which can reduce the amount of data and improve query efficiency.
The existing solution is to establish an external table and use HiveMeta Server to manage the meta-information of the external data source, and perform unified multi-authority management on the table.
For example: MongoDB table mapping
CREATE EXTERNALTABLE mongo_test
USING com.mongodb.spark.sql
OPTIONS (
spark.mongodb.input.uri "mongodb://用户名:密码@IP:PORT/库名?authSource=admin",
spark.mongodb.input.database "库名",
spark.mongodb.input.collection "表名",
spark.mongodb.input.readPreference.name "secondaryPreferred",
spark.mongodb.input.batchSize "20000"
);
After upgrading Spark3.X and introducing the concept of namespace, DatasourceV2 can implement the Multiple Catalog mode in the form of plug-ins, which will greatly improve the flexibility of federated queries.
Kyuubi performance test
We conducted a test based on the TPC-DS generated 500GB data volume. Select some fact tables and dimension tables, and perform performance stress tests on Hive and Kyuubi respectively. The main focus scenarios are:
* 单用户和多用户场景
* 聚合函数性能对比
* Join 性能对比
* 单 stage 和多 stage 性能对比
Comparing the pressure test results, Kyuubi has improved the performance of Hive in most scenarios based on the Spark engine by 3-6 times, while the multi-tenant and concurrent scenarios are more efficient and stable.
Improvement and optimization of Kyuubi by T3 travel
Our improvements and optimizations to Kyuubi mainly include the following aspects:
* Kyuubi Web:启动一个独立多 web 服务,监控管理 Kyuubi Server。
* Kyuubi EventBus:定义了一个全局的事件总线。
* Kyuubi Router:路由模块,可以将专有语法的 SQL 请求转发到不同的原生 JDBC 服务上。
* Kyuubi Spark Engine:修改原生 Spark Engine。
* Kyuubi Lineage:数据血缘解析服务,将执行成功多 SQL 解析存入图数据库,提供 API 调用。
Kyuubi web service function
* 当前运行的 SparkContext 和 SQL 数量
* 各个 Kyuubi Server 实例状态
* Top 20: 1天内最耗时的 SQL
* 用户提交 SQL 排名(1天内)
* 展示各用户 SQL 运行的情况和具体语句
* SQL 状态分为:closed,cancelled,waiting和running。其中waiting和running 的 SQL 可取消
* 根据管理租户引擎对应队列和资源配置、并发量
* 可以在线查看、修改 Kyuubi Server、Engine 相关配置
Kyuubi EventBus
RESTful Service is introduced on the server side.
In the Server application process, the event bus monitors events including application stop time, JDBC session closure, and JDBC operation cancellation. The purpose of introducing the event bus is to communicate between different sub-services in a single application. Otherwise, different sub-service objects need to include each other's instance dependencies, and the service object model will be very complicated.
Kyuubi Router
The Kyuubi JDBC Route module has been added, and the JDBC connection will be directed to this service first.
The service is forwarded to different services according to established policies. The figure below shows the specific strategy.
Kyuubi Spark Engine
* 将 Kyuubi-Spark-Sql-Engine 的 Spark 3.X 版本改成了 Spark 2.4.5,适配集群版本,后续集群升级会跟上社区版本融合
* 增加了Hudi datasource 模块,使用 Spark datasource 计划查询 Hudi,提高对 Hudi 的查询效率
* 集成 Hudi 社区的 update、delete 语法,新增了 upsert 语法和 Hudi 建表语句
Kyuubi Lineage
SQL blood relationship analysis function based on ANTLR. Two modes are currently provided. One is timing scheduling, which analyzes successfully executed SQL statements within a certain time range, and stores the analysis results in the HugeGraph library for calls such as data management systems. Another mode is to provide API calls, which users can directly call when querying. When the SQL is complicated, you can intuitively clarify your own SQL logic, so that you can modify and optimize your own SQL.
Kyuubi-based solution
Summarize
The T3 travel big data platform is based on Apache Kyuubi 0.8, which achieves unification of data services, greatly simplifies offline data processing links, and also guarantees query delay requirements. Later, we will use it to improve data services and queries in more business scenarios. ability. Finally, I would like to thank the Apache Kyuubi community for their support. The follow-up plan is to upgrade to the new version of the community to keep pace with the community, and at the same time, some function points based on the T3 travel scene will continue to be given back to the community for common development. I also hope that Apache kyuubi will become better and better as the leader of Serverless SQL on Lakehouse!
Author: Li Xinkai, Senior Big Data Engineer, T3 Travel
Kyuubi Homepage:
Kyuubi source code:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。