Huawei Cloud PB-level database GaussDB (for Redis) reveals the ninth issue: comparison with HBase

Abstract: Gaussian Redis, which combines the advantages of open source Redis and HBase, provides lower cost, better performance, and more flexible database services!

This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD PB-level database GaussDB (for Redis) Secret Issue 9: Comparison with HBase ", the original author: Gauss Redis official blog.

0 Preface

HBase is a distributed, column-oriented open source database based on the Hadoop ecosystem. Today, when NoSQL is booming, it has been selected by many companies at home and abroad to apply to different businesses in modern Internet systems. This article briefly describes the basic architecture and usage scenarios of HBase, focusing on the performance of the key features of HBase in this scenario, as well as the remaining pain points in the use of HBase; at the same time, it introduces Huawei's self-developed strong consistent and persistent NoSQL database GaussDB (for Redis) (hereinafter referred to as Gauss Redis) in the above scenarios, and the improvement of HBase pain points.

1. Brief introduction of HBase system

The physical structure of HBase mainly includes ZooKeeper, HMaster, RegionServer, HDFS and other components. ZooKeeper is used to realize the high availability of HMaster, the monitoring of RegionServer, the entry of metadata, and the maintenance of cluster configuration. The role of HMaster is to maintain the region information of the entire cluster, process metadata changes and load balancing. RegionServer is a node that directly processes user read and write requests, and actually handles the read, write, split, and other tasks of the allocated Region, and uses WAL to implement a fault-tolerant mechanism. HDFS provides the ultimate bottom-level data storage service, provides the bottom-level distributed storage service of metadata and table data, and uses multiple copies of data to ensure high reliability and high availability.

In the logical structure, RowKey is the primary key of the table, and is arranged in lexicographical order. After HRegion reaches a certain size, it will be fissioned according to the range of RowKey. ColumnFamily splits the table vertically and divides multiple Columns into a group for management. In HBase, ColumnFamily is the schema of the table but Column is not. Cell is the specific value saved. In HBase, all data is stored in bytecode.

2. HBase shows its talents

2.1 Storage of label data

Tag data is a representative of a sparse matrix, describing various attributes of an entity, and is mainly used in fields such as intelligent recommendation, business intelligence, or marketing engines.

Three different users left a large amount of behavioral data in different APPs under the same company. These data include the user information directly filled in, the specific behavior of using the APP, and the marks of certain phenomena by domain experts. The labeling algorithm can get such data:

We can find that there are limitations to user behavior collection, so the types of tags that can be obtained are different, and a large number of data items in the table can only be left blank, which is the so-called sparse matrix. And as users use the APP more deeply, it can be foreseen that the areas of interest/non-interest to users will gradually be discovered, and the number of columns in the table will increase accordingly.

This feature is disastrous for MySQL. This is because the table structure must be defined when MySQL is building a table. The dynamic addition and deletion of attributes is a huge workload. At the same time, the storage of a large number of NULL values will cause storage costs to become unacceptable. However, when using HBase storage, the column that does not specify a value does not occupy any storage space, so limited resources can be used efficiently, and the HBase table only needs to be specified when the ColumnFamily is created, and the addition and deletion of Columns is extremely easy, which is conducive to coping. Expansion of future attributes.

2.2 Collection of data on the Internet of Vehicles

The Internet of Vehicles system uses on-board equipment to collect various data generated during vehicle operation, upload it through the network in real time, and perform dynamic analysis and utilization on the platform.

We can find that the characteristics of the data facing the Internet of Vehicles system is that a large number of vehicle terminals continuously write TB-level or even PB-level data with high concurrency, and for real-time analysis, in order to ensure the timeliness of the analysis results, query is required. Low latency response.

HBase adopts the LSM storage model, which can calmly deal with high concurrent write scenarios, while also ensuring that the read latency is within an acceptable range. At the same time, HBase has good horizontal scalability. The dynamic adjustment of storage capacity can be achieved by adding or subtracting RegionServers to meet the requirements for usage costs.

2.3 Retention of transaction records

In the field of mobile payment, ensuring the security of sensitive information such as historical transaction records is an important topic. When the data center encounters natural disasters or external attacks, it must ensure that this information is not lost, and from a business perspective, it is necessary to ensure that the RTO is as short as possible and the RPO is as zero as possible.

HBase is based on the underlying HDFS as a storage system. HDFS implements a three-replica strategy. The replicas are placed in different nodes or racks according to certain rules, and it has high disaster tolerance. In engineering practice, strategies such as Region replica, active-standby cluster, mutual backup and active-active are also produced to perform disaster recovery as much as possible and ensure high availability.

3. HBase is not omnipotent

As can be seen from the above three examples, HBase is based on its own design and performs very well in sparse matrix storage, high-concurrency and high-traffic write resistance, high availability and high reliability scenarios, but this does not mean HBase Can adapt to all scenarios without any weakness.

3.1 HBase's Achilles' Heel

1. Juliet suspended

The Java system cannot bypass the discussion of Full GC. When HBase causes STW in Full GC, ZooKeeper will not receive the heartbeat from RegionServer, and then judge this node as down, and other nodes will take over the data. When Full GC is over, RegionServer will take the initiative to commit suicide in order to prevent split brain. Pause for Juliet. Such problems generally require experienced Java programmers to fine-tune the GC strategy according to the business scenario in order to avoid them as much as possible.

2. Less data type

The type of storage supported by HBase is a byte array. In use, data such as strings, complex objects, and even images need to be converted into byte arrays for storage. However, such storage can only represent loose data relationships. For data structures or data relationships such as collections, queues, and maps, developers need to code to implement conversion logic for storage, which is less flexible.

3. Performance bottleneck

HBase is divided into Regions for storage according to the lexicographic order of RowKey. A poor RowKey design will cause uneven load. If a large number of requests reach a certain region to form a hot spot, then the IO of the RegionServer where it is located may be blown up.

After the RegionServer goes offline, ZooKeeper needs to discover that the node is down, move the data it is responsible for to other nodes to take over, and modify the Region information in the meta table. During this process, the data on the RegionServer will become unavailable, and requests for this part of the data will be blocked.

3.2 Redis' Wings of Icarus

3.2.1 Good performance of open source Redis

The features of open source Redis solve the pain points of HBase to a certain extent, because it has the following advantages:

1. Richer data types

The Redis 5.0 protocol includes nine data types: String, List, Set, ZSet, Hash, Bit Array, HyperLogLog, Geospatial Index, Streams, and related operations based on these data types. Compared with the single data type of HBase, Redis gives developers more choices to express the relationship between data and data.

2. The silky feeling of pure memory

The essence of open source Redis is a key-value type in-memory database, and the entire database is loaded in memory for operation. This means that the response speed and processing capacity of Redis far exceed HBase, which requires disk IO. At present, a large number of test results show that the performance of open source Redis can reach 100,000 reads and writes per second.

3.2.2 Significant weaknesses of open source Redis

The operation of pure memory also makes open source Redis unavoidable weaknesses, which are mainly reflected in the following two aspects:

1. A nightmare under a large amount of data

When the amount of data continues to increase, limited memory becomes a usage limit. At this time, a larger capacity of memory must be used to complete the full load of data, and the price of memory is much higher than that of disks, which will lead to a surge in usage costs. At the same time, common server memory is mostly gigabytes, which also severely limits the competitiveness of open source Redis in the field of high-level databases.

2. What to

Another drawback of pure memory operation is that all data will be lost after a downtime. The existing solution is to use AOF or RDB to persist data, and the data can be restored in memory after the process is restarted. But these two methods are not complete. AOF is a collection of executed commands, so the recovery speed is relatively slow; RDB regularly dumps memory data, so there is a risk of data loss. In addition, half of the memory needs to be reserved in the worst scenario, which reduces the memory usage.

4. Gauss Redis: adults do not do multiple-choice questions

HBase and open source Redis each have their own strengths. At this time, a familiar sentence emerged in my mind: children do multiple-choice questions, and adults of course need them all. Gauss Redis has the advantages of both, which better meets the needs of database services. Demand.

compatible with Redis5.0 protocol

Continuing the rich data types of open source Redis, providing more options for describing data and data relationships. For example, in the sparse matrix scene, the Hash type is used, and there is no need to define the HBase table ColumnFamily, which can organize data more flexibly.

performance equals open source Redis

Refer to [Huawei Cloud Gauss DB (for Redis) and open source Redis cluster performance comparison] can be seen that the performance of Gauss Redis and open source Redis is almost the same. In the scene of high traffic and high concurrency, it can provide better reading than HBase. Write performance.

Higher disaster recovery reliability

Gauss Redis is based on the storage layer constructed by Huawei’s self-developed distributed and strongly consistent data lake DFV. The 3AZ feature has been launched in some sites. The physical isolation of wind, heat, water and electricity between AZs will not affect the failure of one AZ. Compared with HBase, other AZs ensure the reliability of key data better.

second-level elastic scaling

Gauss Redis uses a storage and computing separation architecture. Data sinks to the storage pool. The expansion and shrinkage of computing nodes only modifies the mapping and does not need to relocate the data. This achieves second-level smooth scaling and does not have the problem of data unavailability when HBase goes online and offline in the Region.

Low-cost mass persistent storage

After logical and physical compression, the full amount of data will be stored in the shared storage pool DFV persistent storage, without downtime data loss, and the comprehensive cost per GB is less than one-tenth of the open source Redis. In practical applications, DFV capacity can be expanded at any time according to business needs, and there is no problem of open source Redis storage limitations.

Automated monitoring, operation and maintenance and other advantages

The Gauss Redis supporting comprehensive monitoring system can visually monitor key performance indicators such as request delays, and can also realize automatic removal of faulty nodes, smooth movement, automatic alarms, and automatic recovery. In addition, Gauss Redis uses the hash strategy to balance the data, which better avoids hot issues compared with HBase, and there is no Full GC trouble.

5 Conclusion

Gauss Redis is compatible with the Redis 5.0 protocol, has both the advantages of open source Redis and HBase, and combines the relevant characteristics of Huawei's self-developed DFV storage to avoid the weaknesses of HBase and open source Redis in typical scenarios, providing lower cost and better performance. Good and more flexible database services.

6. Appendix

Author of this article: Huawei Cloud Gauss Redis team.

Hangzhou Xi’an Shenzhen resume delivery: yuwenlong4@huawei.com

For more technical articles, please follow the official blog of Gauss Redis:
https://bbs.huaweicloud.com/community/usersnew/id_1614151726110813

The official homepage of Gauss Redis:
https://www.huaweicloud.com/product/gaussdbforredis.html

Click to follow, and get to know the fresh technology of

Huawei Cloud PB-level database GaussDB (for Redis) reveals the ninth issue: comparison with HBase

0 Preface

1. Brief introduction of HBase system

2. HBase shows its talents

2.1 Storage of label data

2.2 Collection of data on the Internet of Vehicles

2.3 Retention of transaction records

3. HBase is not omnipotent

3.1 HBase's Achilles' Heel

3.2 Redis' Wings of Icarus

3.2.1 Good performance of open source Redis

3.2.2 Significant weaknesses of open source Redis

4. Gauss Redis: adults do not do multiple-choice questions

5 Conclusion

6. Appendix

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

百万级群聊的设计实践

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

【Hadoop】HBase系统解析及适用场景

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

《SQL应用场景解析：如何通过SQL解决实际业务问题》