Spark vs. OushuDB! Who is it that is dozens of times faster?

With the continuous development of Internet technology, the amount of data processing in all walks of life is increasing day by day. As a revolutionary technology, Hadoop provides the ability to process massive data. The accompanying Spark greatly improves the computing power of Hadoop and solves the problem. The performance problem of Hadoop has been touted by the big data industry. But in 2022, is Spark still the best choice for the big data industry?

After years of development, the Hadoop ecosystem has been widely adopted around the world. Many companies have built big data platforms based on the Hadoop ecosystem, and have attempted more in-depth applications, such as data warehouse migration, as analytical scenarios. The main components of Hive and Spark play a major role.

The SQL support on Hadoop was originally Apache Hive. The computing engine that comes with Hive is MapReduce, which is disk-oriented. It is limited by the constraints of disk read/write performance and network I/O performance. Data query and other aspects are not efficient, and its main application scenario is batch mode. In response to this deficiency, Spark stores data in memory and performs calculations based on memory is an effective solution. Spark allows intermediate outputs and results to be stored in memory, saving a lot of disk IO. And use the DAG scheduler, query optimizer and physical execution engine to achieve high performance for both batch and streaming data. At the same time, Spark's own DAG execution engine also supports the calculation of data in memory.

OushuDB, a data warehouse developed by Even Technology, mainly relies on a series of changes in underlying technologies such as cloud-native features, computing-storage separation architecture, strong transaction features, complete SQL standard support, and high-performance parallel execution capabilities, thereby achieving high elasticity, high performance, and strong performance. Changes in upper-layer technologies such as scalability and strong compatibility will ultimately help enterprises effectively cope with the trends of large-scale, strong sensitivity, high timeliness, and intelligence.

This time we will compare the performance of OushuDB and Spark 3.0.

Which data query is stronger?

In order to compare the query capabilities of Spark and OushuDB more intuitively, we use TPC-H (Business Intelligence Computing Test) to test OushuDB and Spark. TPC-H is developed by the US Transaction Processing Performance Council (TPC, Transaction Processing Performance Council). It is a test set used to simulate decision support applications, and it is currently widely used in academia and industry to evaluate data query processing capabilities.

The international database test standard TPC-H includes 22 queries (Q1~Q22). Our main evaluation index is the response time of each query, that is, the time required from submitting the query to returning the result. The nodes are tested with a dataset with a Scale of 100.

test environment

server configuration

CPU: 2x 10-core Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, HT 40
RAM: 256GB
Hard disk: 4*1000GB SSD
Operating System: Centos 7.4

Compare software versions

OushuDB 4.0
Spark 3.0

Database parameters

Spark

OushuDB

Note: In order to test at the same resource level and closer to the actual production, the core and memory settings are the same, 16 core and 1gb respectively

table properties

Note: For data distribution, OushuDB can set and control the "bucket number" of data distribution at the table level, which directly affects resource usage

How the data is generated

Use dbgen to generate text data for TPCH testing in advance; OushuDB uses external tables to import in parallel and analyze. OushuDB uses a writable external table to write the imported data into the specified HDFS directory for Spark to import the data.

Spark creates an external table, points to OushuDB to write out HDFS files, and imports data.

Comparison of running results

Summarize

Spark's new Adaptive Query Execution (AQE) framework only improves Spark performance in some scenarios. Based on this TPC-H test, due to the advantages of the new SIMD executor, OushuDB's overall performance exceeds Spark's by a maximum of 55 times. Overall (22 queries) A) performance is more than 8 times. In the process of large-scale data query in practical application scenarios in various industries, the advantages of OushuDB are quite obvious.

As a high-performance cloud database, OushuDB supports access to standard ORC files, is highly scalable, follows the ANSI-SQL standard, has extremely fast executors, and provides PB-level data interactive query capabilities, which are 5 faster than traditional data warehouses/MPPs. -10x, 5-30x faster than the Hadoop SQL engine. At the same time, OushuDB solves the problems of high cost, high threshold, difficult maintenance and expansion of traditional data warehouses through the separation of computing and storage architecture, allowing enterprise users to easily build core data warehouses, data marts, real-time data warehouses, and lake warehouse integrated data platforms. It is the best choice for today's enterprises to build a data lake warehouse.

Spark vs. OushuDB! Who is it that is dozens of times faster?

Which data query is stronger?

test environment

server configuration

Compare software versions

Database parameters

table properties

How the data is generated

Comparison of running results

Summarize

偶数科技

引用和评论

受美制裁，俄罗斯 ClickHouse 能否扛起数据库大旗？

如何选择分析型数据库？企业级选型指南与 2025 趋势解读

BI 工具响应慢？可能是 OLAP 层拖了后腿

【活动回顾】StarRocks Singapore Meetup #2 @Shopee

微信基于 StarRocks 的实时因果推断实践

深入理解 Bitmap 索引：原理、场景与应用案例

什么是 OLAP 数据库？企业如何选择适合自己的分析工具