Virtual data warehouse based on Impala-based high-performance data warehouse construction practice

[Click to learn more about big data dry goods]
Guide:
This article mainly introduces the virtual data warehouse features implemented by NetEase Shufan NDH on Impala, including functions such as resource grouping, horizontal expansion, mixed grouping and time-sharing multiplexing, which can flexibly configure cluster resources, balance node load, improve query concurrency, and fully Utilize node resources.

Continue the previous one. For high-performance analytical data warehouses, in addition to having an excellent execution engine to allow queries to be completed as soon as possible, it is also necessary to avoid problems such as query performance degradation due to mutual interference between queries, such as competition for computing and IO resources. As mentioned in the previous section, Impala can manage computing resources through resource pools. However, when using it, it is found that the resource pool alone is not enough, and there are still problems such as different resource pools competing for memory resources on the same computing node.

1 Basic Concepts

"Virtual data warehouse" comes from Snowflake's "virtual warehouse", referred to as VW. The virtual data warehouse can expand and contract horizontally and vertically as needed. It is an efficient resource scheduling method and a very good verification case for the elastic scaling of computing resources under the design architecture of separation of storage and computing. As shown in the figure below, the Snowflake cluster has two virtual data warehouses, serving BI and ETL users respectively. Among them, the BI virtual data warehouse adopts a unitized horizontal expansion and contraction mode in order to cope with the high and low peaks of report queries. The ETL mainly focuses on the computing power and adopts the mode of changing the specifications of the virtual data warehouse.

The Impala component of NDH also has similar capabilities. Before we start, we will introduce two basic concepts based on the actual situation of Impala. Then there is the concept of node group introduced to support virtual data warehouses.

Executor group

The following figure is a schematic diagram of the Impala execution group in the CDP document. The execution group is the basic unit of Impala's elastic scaling. The user can configure the execution group specification (XSMALL, SMALL, MEDIUM, or LARGE). If auto scaling is enabled, CDP will expand or shrink the number of Impala executor nodes according to the specified specifications each time.

Execution groups provide an Impala cluster with the ability to scale horizontally. However, it is quite different from the virtual data warehouse described by Snowflake. From the current introduction, the execution group is a concept transparent to the user, and the user cannot divide the Impala cluster into computing units for different purposes through the execution group, as mentioned above. For BI and ETL. Therefore, NDH Impala introduces the concept of node group.

Node group

The impalad nodes of an NDH Impala cluster can be divided into multiple independent groups, which we call node groups. A node group can consist of only executors or coordinator nodes.

The Impala cluster in the figure above contains 3 node groups, and each node group must have at least one executor node in the impalad. In addition, there are 2 coordinator nodes that are independent of the node group. An independent coordinator node can route requests to the executors in any node group, and the coordinators in the node group can only distribute requests to the executor nodes in the group for execution. According to the differences in query routing rules, there are two ways to implement virtual data warehouses.

2 Implementation methods

NDH Impala supports two virtual data warehouse implementations, namely, a static configuration scheme based on zookeeper addresses and a dynamic configuration scheme based on session parameters.

2.1 Static configuration

In this solution, the coordinator nodes of different node groups are registered to different zookeeper addresses, and the Hive JDBC client can connect to different zookeeper addresses to obtain the coordinators of different business groups, so as to connect and issue SQL requests. In this way, each node group will have one or more unique coordinator nodes, which are responsible for delivering the execution plan generated by SQL to the executor nodes in the group for execution.

The cluster shown above has 3 virtual bins: group 1, group 2 and group 3. They share the same statestored and catalogd, and share the same data warehouse metadata. The impalad resources between virtual data warehouses are physically isolated, and the coordinator node of a virtual data warehouse only sends queries to the executor nodes in the group. In a production environment, multiple virtual data warehouses can be configured to receive query requests of different types of services, so that queries of different services are isolated from each other in the use of computing resources and do not affect each other. In the figure, group 1 is used for ad-hoc For query, group 2 is used for data BI reports, and group 3 is used for self-service data retrieval by data BI. Compared with the multi-cluster method, the multi-virtual data warehouse method requires less resources and is more flexible in configuration.

2.2 Dynamic Routing

This solution adds a query option parameter request_group to the session connection. Through the set request_group=xxx statement, the coordinator will automatically route the query to the specified group for execution. The default value of request_group is default, and the default value of corresponding group_name is also default. In other words, if request_group is not specified, the query will be sent to the default default group for execution.

In this solution, the coordinator node is public, and only the executor nodes are grouped, which is more similar to Snowflake's virtual data warehouse in implementation. As shown in the figure below, there are 2 public coordinators and 3 groups. Since there is no default group, the default group can be configured as grp1. It can be dynamically configured through parameters, which is more flexible than the zookeeper-based solution, and users can freely switch queries on different virtual data warehouses according to their needs.

The above two solutions have been implemented. Since the NDH production environment generally connects to the zookeeper through Hive JDBC to access Impala, the former method is more compatible. Currently, virtual data warehouses are mainly deployed online in this way. The advanced features of virtual data warehouses introduced in this section mainly focus on the former.

3 Main Features

3.1 Horizontal expansion

If the resources and concurrency of a single node group in the virtual data warehouse have reached the bottleneck, simply adding nodes in the group cannot effectively increase the number of concurrent queries. In this case, a node group with the same or similar specifications can be added to the virtual data warehouse. Configure the zookeeper address of the coordinator in the new node group to be the same as the original node group. With the help of the randomness of Hive JDBC in selecting the coordinator address under zookeeper, the query load can be balanced to the old and new node groups. In this way, the query concurrency of the cluster can be increased nearly linearly.

As shown in the figure above, the Impala cluster has 2 virtual data warehouses, and the corresponding node groups are group1 and group3, respectively, and the services undertaken are BI reports and ABTest scenarios of the business. Assuming that group1 is the original group, there are 3 impalad nodes (1 coordinator, 2 executors). The newly added group group2 is also 3 impalad nodes. Using the same configuration as group1, it can achieve the effect of horizontal expansion.

3.2 Transparent scaling

NDH Impala can increase or decrease the number of impalad nodes in a virtual data warehouse node group online according to the load of each virtual data warehouse, so as to achieve dynamic resource scaling between groups. When the impalad process in the node group is offline through the graceful shutdown method provided by Impala, it will first prohibit the sending of new query requests to the impalad node, and wait for the query fragment (fragment) executing on it to complete before shutting down. Therefore, it will not cause the query being executed on it to terminate abnormally, so that the user does not feel it. In the production environment, an NDH Impala cluster configured with multiple virtual data warehouses can dynamically increase or decrease the number of nodes between the virtual data warehouses by analyzing the historical query rules and combining the system load of the impalad nodes in the group to achieve a more adequate Use the resources of each node.

Taking NetEase Cloud Music as an example, there are several BI self-service data retrieval (easyfetch) queries that generally occur during working hours, and several BI reports require a large number of report results preloading operations before users go to work (the report query SQL is issued in advance and the query results are cached so as to Improve report viewing experience). We can configure the two scenarios of easyfetch and BI reports as two virtual data warehouses in the same NDH Impala cluster. Before going to work, move most of the impalad nodes of the easyfetch virtual data warehouse to the BI report virtual data warehouse. Improve report preloading efficiency.

Of course, transparent scaling is not only applicable between virtual data warehouses. For cloud environments, through k8s or similar scheduling mechanisms, you can easily apply for container or virtual machine resources during peak load and quickly replenish them online. After the peak period, the increased resources are released to cloud vendors.

4 Advanced functions Compared with the Impala resource queue, the coordinator node in the node group of the virtual data warehouse will never use the computing resources (executors) of other groups, and the resource isolation is more thorough, so that the query performance of different business modules will not affect each other. However, there will be load differences between the businesses to which different virtual data warehouses belong, which may lead to insufficient resource utilization. In order to improve the resource utilization of idle node groups, the virtual data warehouse feature is further enhanced, and functions such as mixed grouping and time-sharing multiplexing are introduced.

4.1 Hybrid grouping Hybrid grouping is to allow an executor node to be in two or more node groups at the same time, as shown in the following figure. The left sub-picture is the normal mode. It is assumed that the NDH Impala cluster is divided into two virtual data warehouses: BI reports and Ad-Hoc queries. Ad-Hoc queries are time-sensitive, queries are concentrated during working hours, and the concurrency of queries is low. . Through mixed grouping, the virtual data warehouse deployment mode can be transformed into the mode of the right subgraph.

In the figure, n1n2 is the coordinator node of the group1 node group, which will be registered on the zookeeper path youdata. The Hive JDBC client obtains any coordinator node from this path and submits a query to it. The coordinator parses the query, optimizes it, and specifies a distributed execution plan. Finally, it is sent to n3n7 for execution. n6n7 is also the executor node of group4. The coordinator of group4 is n8n9, which will receive the query from the zookeeper path Ad-Hoc, specify the distributed execution plan, and send it to n6~n8.

4.2 Time division multiplexing

Time-division multiplexing is another advanced feature that can improve resource utilization. By automatically configuring the grouping resources of the cluster in a specific time period, the query pressure of some high-load groups is relieved, and the user experience is improved.

In terms of implementation, the same coordinator can be registered to multiple zookeeper addresses, and the effective time of registration to each address can be configured. As shown in the figure above, the Ad-Hoc virtual data warehouse can be registered from 8:00 pm to 8:00 am every day. The two coordinators (or one of them) of n8 and n9 are registered to the same zookeeper address of the BI report virtual data warehouse to share the query load of the BI report.

Compared with mixed grouping, the time-division multiplexing function is only suitable for use between node groups with similar specifications, ensuring that there is no obvious difference in query performance on different groups.

4.3 Load-Based Node Selection

Executor nodes may have a variety of reasons that lead to unbalanced use of computing resources. For example, data skew causes some executor nodes to consume more computing resources to scan and process data, or the introduction of hybrid grouping features leads to excessive node load on some node groups, etc. Wait.

For this problem, NDH Impala has made two optimizations. The first is to support the distributed execution of queries based on the load of executor nodes. The method is to consider the currently available computing resources of the executor nodes when determining the distributed execution plan for query SQL, and eliminate the executor nodes with less available resources; The first is to limit the total resource usage of query requests on the same queue on one executor when there are multiple queues, so as to avoid the executor resources being monopolized by a certain queue.

5 Summary

This section mainly introduces the source and implementation of the concept of virtual data warehouse, and focuses on the exploration, thinking and use of NDH Impala in the field of virtual data warehouse. At present, virtual data warehouses have successful application cases in NetEase Internet business and NetEase Shufan's commercial customer clusters.

The author believes that virtual data warehouses should be a necessary capability for the new generation of analytical data warehouses, which can strip off complex and diverse business loads and give full play to the capabilities of the execution engine itself. Finally, it should be pointed out that the virtual data warehouse is a cloud-native feature, and an environment with flexible computing resources can maximize its value.

About the author Rong Ting, NetEase Shufan database technical expert, 10 years + database related work experience, currently mainly responsible for the research and development of high-performance data warehouse and cloud native database.

Virtual data warehouse based on Impala-based high-performance data warehouse construction practice

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Virtual data warehouse based on Impala-based high-performance data warehouse construction practice

网易数帆

引用和评论

一图看懂网易数帆指标平台EasyMetrics

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈