头图

With the continuous development of Internet technology, the amount of data processing in all walks of life is increasing day by day. As a revolutionary technology, Hadoop provides the ability to process massive data. The accompanying Spark greatly improves the computing power of Hadoop and solves the problem. The performance problem of Hadoop has been touted by the big data industry. But in 2022, is Spark still the best choice for the big data industry?

After years of development, the Hadoop ecosystem has been widely adopted around the world. Many companies have built big data platforms based on the Hadoop ecosystem, and have attempted more in-depth applications, such as data warehouse migration, as analytical scenarios. The main components of Hive and Spark play a major role.

The SQL support on Hadoop was originally Apache Hive. The computing engine that comes with Hive is MapReduce, which is disk-oriented. It is limited by the constraints of disk read/write performance and network I/O performance. Data query and other aspects are not efficient, and its main application scenario is batch mode. In response to this deficiency, Spark stores data in memory and performs calculations based on memory is an effective solution. Spark allows intermediate outputs and results to be stored in memory, saving a lot of disk IO. And use the DAG scheduler, query optimizer and physical execution engine to achieve high performance for both batch and streaming data. At the same time, Spark's own DAG execution engine also supports the calculation of data in memory.

OushuDB, a data warehouse developed by Even Technology, mainly relies on a series of changes in underlying technologies such as cloud-native features, computing-storage separation architecture, strong transaction features, complete SQL standard support, and high-performance parallel execution capabilities, thereby achieving high elasticity, high performance, and strong performance. Changes in upper-layer technologies such as scalability and strong compatibility will ultimately help enterprises effectively cope with the trends of large-scale, strong sensitivity, high timeliness, and intelligence.

image.png

This time we will compare the performance of OushuDB and Spark 3.0.

Which data query is stronger?

In order to compare the query capabilities of Spark and OushuDB more intuitively, we use TPC-H (Business Intelligence Computing Test) to test OushuDB and Spark. TPC-H is developed by the US Transaction Processing Performance Council (TPC, Transaction Processing Performance Council). It is a test set used to simulate decision support applications, and it is currently widely used in academia and industry to evaluate data query processing capabilities.

The international database test standard TPC-H includes 22 queries (Q1~Q22). Our main evaluation index is the response time of each query, that is, the time required from submitting the query to returning the result. The nodes are tested with a dataset with a Scale of 100.

test environment

server configuration

CPU: 2x 10-core Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, HT 40
RAM: 256GB
Hard disk: 4*1000GB SSD
Operating System: Centos 7.4

Compare software versions

OushuDB 4.0
Spark 3.0

Database parameters

Spark
image.png

OushuDB
image.png
Note: In order to test at the same resource level and closer to the actual production, the core and memory settings are the same, 16 core and 1gb respectively

table properties

image.png
Note: For data distribution, OushuDB can set and control the "bucket number" of data distribution at the table level, which directly affects resource usage

How the data is generated

Use dbgen to generate text data for TPCH testing in advance; OushuDB uses external tables to import in parallel and analyze. OushuDB uses a writable external table to write the imported data into the specified HDFS directory for Spark to import the data.

Spark creates an external table, points to OushuDB to write out HDFS files, and imports data.

Comparison of running results

image.png

image.png

Summarize

Spark's new Adaptive Query Execution (AQE) framework only improves Spark performance in some scenarios. Based on this TPC-H test, due to the advantages of the new SIMD executor, OushuDB's overall performance exceeds Spark's by a maximum of 55 times. Overall (22 queries) A) performance is more than 8 times. In the process of large-scale data query in practical application scenarios in various industries, the advantages of OushuDB are quite obvious.

As a high-performance cloud database, OushuDB supports access to standard ORC files, is highly scalable, follows the ANSI-SQL standard, has extremely fast executors, and provides PB-level data interactive query capabilities, which are 5 faster than traditional data warehouses/MPPs. -10x, 5-30x faster than the Hadoop SQL engine. At the same time, OushuDB solves the problems of high cost, high threshold, difficult maintenance and expansion of traditional data warehouses through the separation of computing and storage architecture, allowing enterprise users to easily build core data warehouses, data marts, real-time data warehouses, and lake warehouse integrated data platforms. It is the best choice for today's enterprises to build a data lake warehouse.


偶数科技
6 声望380 粉丝

⌈北京偶数科技有限公司⌋ 是一家领先的云数据库和 AI 产品提供商,致力于赋能全球各行业客户。我们的愿景和使命是 “让人类只为兴趣而工作”。公司核心产品偶数数据云 Oushu Data Cloud 由新一代极速云数据库 OushuD...