Summary of HBase knowledge points

One, HBase basics

HBase is a distributed, scalable NoSQL database that supports massive data storage based on the Hadoop file system. HBase is the open source Java version of BigTable. It is a database system based on HDFS that provides high reliability, high performance, column storage, scalability, and real-time read and write NoSql.

It is between NoSql and RDBMS. It can only retrieve data through the primary key (rowKey) and the range of the primary key, and only supports single-row transactions (complex operations such as multi-table joins can be implemented through Hive support). Mainly used to store structured and semi-structured loose data. HBase query data function is very simple, does not support complex operations such as join, and does not support complex transactions (row-level transactions).

HBase does not limit the types of stored data, allows dynamic and flexible data models, does not use SQL language, and does not emphasize the relationship between data. HBase is designed to run on a server cluster and can scale horizontally accordingly. HBase technology can be used to build a large-scale structured storage cluster on a cheap PC Server.

HBase itself is a data model, similar to Google's big table design (BIgtable), which can provide fast and random access to massive structured data. It takes advantage of the fault tolerance provided by the Hadoop file system (HDFS), provides random real-time read/write access to data, and is part of the Hadoop file system. People can store HDFS data directly or through HBase, or use HBase to read consumption/random access data in HDFS. HBase uses Hadoop's MapReduce to process massive amounts of data. In terms of collaborative services, Google Bigtable uses Chubby to support it, and HBase's Zookeeper corresponds to it.

Tables in HBase generally have these characteristics:

Large: A table can have one billion rows and millions of columns.
Column-oriented: column-oriented (family) storage and permission control, column (family) independent retrieval.
Sparse: For null (null) columns, no storage space is taken up. Therefore, the table can be designed to be very sparse.

2. Comparison of HDFS, Hive and HBase

1. Comparison of HDFS and HBase

HDFS

Provide a file system for distributed storage
Optimized for storing large-sized files, no need to read and write files on HDFS randomly
Use files directly
Data model is not flexible
Use file system and processing framework
Optimize the way of writing once and reading many times

HBase

Provide tabular column-oriented data storage
Optimized for random read and write of tabular data
Use key-value to manipulate data
Provide a flexible data model
Use table storage, support MapReduce, rely on HDFS
Optimized multiple reads and multiple writes

2. Comparison of Hive and HBase

Hive

(1) Data warehouse

The essence of Hive is actually equivalent to making a bijective relationship in Mysql for the files already stored in HDFS to facilitate the use of HQL to manage queries.

(2) For data analysis and cleaning

Hive is suitable for offline data analysis and cleaning, with high latency.

(3) Based on HDFS, MapReduce

The data stored in Hive is still on DataNode, and the written HQL statements will eventually be converted to MapReduce code for execution.

HBase

(1) Database

Is a non-relational database for column family storage.

(2) Used to store structured and unstructured data

It is suitable for the storage of single-table non-relational data, and is not suitable for related queries, similar to JOIN and other operations.

(3) Based on HDFS

The manifestation of data persistent storage is HFile, which is stored in DataNode and managed by ResionServer in the form of region.

(4) Low latency, access to online business use

Faced with a large amount of enterprise data, HBase can store a large amount of data in a single table in a straight line, while providing efficient data access speed.

Third, the ability of HBase in commercial projects

every day:
(1) Message volume: The number of messages sent and received exceeds 6 billion
(2) Reading and writing of nearly 100 billion pieces of data
(3) Around 1.5 million operations per second during peak periods
(4) The overall read data occupies about 55%, and the write occupies 45%
(5) More than 2PB of data, involving a total of 6PB of redundant data
(6) The data grows by approximately 300 gigabytes per month.

Four, HBase system architecture

architecture role:

（1）Region Server

Region Server is the manager of Region, and its implementation class is HRegionServer. Its main functions are as follows:

Monitor RegionServer
Handling RegionServer failover
Process metadata changes
Process the allocation or removal of regions
Load balance of data during free time
Publish your location to customers through Zookeeper

（2）Master

Master is the manager of all Region Servers, its implementation class is HMaster, and its main functions are as follows:

Responsible for storing the actual data of HBase
Process the Region assigned to it
Flush the cache to HDFS
Maintain HLog
Perform compression
Responsible for processing Region fragmentation

（3）Zookeeper

HBase uses Zookeeper to do the high availability of the Master, the monitoring of the RegionServer, the entry of metadata, and the maintenance of the cluster configuration.

（4）HDFS

HDFS provides the ultimate underlying data storage service for HBase, and at the same time provides high availability support for HBase.

（5）Client

Contains the interface for accessing HBase, and maintains the cache to speed up the access to HBase.

HRegionServer component

1. Write-Ahead logs

The modification record of HBase. Since data must be sorted by MemStore before being flushed to HFile, there is a high probability of data loss if the data is stored in memory. In order to solve this problem, the data will be written in a file called Write-Ahead logfile. Then write it to the memory. So when the system fails, the data can be reconstructed through this log file.

2. HFile

This is the actual physical file that saves the original data on the disk, and is the actual storage file.

3. StoreFile

HFile is stored in the Store, and a Store corresponds to a column family in the HBase table. The physical file that stores the actual data, StoreFile is stored on HDFS in the form of HFile. Each Store has one or more StoreFiles (HFile), and the data is ordered in each StoreFile.

4. MemStore

Write cache, because the data in HFile is required to be ordered, the data is stored in MemStore first. After sorting, it will be flashed to HFile when the flashing time is reached. Each flashing will form a new HFile. .

5. Region

For fragmentation of the Hbase table, the HBase table will be divided into different regions according to the rowKey value and stored in the RegionServer. There can be multiple different regions in a RegionServer.

Five, HBase shell operation

1. Basic operation

1. Enter the HBase client command line
bin/hbase shell
2. View help commands
hbase(main):001:0> help
3. View which tables are in the current database
hbase(main):002:0> list

2. Table operation

1. Create table
hbase(main):002:0> create 'student','info'
2. Insert data into the table
hbase(main):003:0> put 'student','1001','info:sex','male'
hbase(main):004:0> put 'student','1001','info:age','18'
hbase(main):005:0> put 'student','1002','info:name','Janna'
hbase(main):006:0> put 'student','1002','info:sex','female'
hbase(main):007:0> put 'student','1002','info:age','20'
3. Scan to view table data
hbase(main):008:0> scan 'student'
hbase(main):009:0> scan 'student',{STARTROW => '1001', STOPROW =>
'1001'}
hbase(main):010:0> scan 'student',{STARTROW => '1001'}
4. View table structure
hbase(main):011:0> describe ‘student’
5. Update the data of the specified field
hbase(main):012:0> put 'student','1001','info:name','Nick'
hbase(main):013:0> put 'student','1001','info:age','100'
6. View the data of "specified row" or "specified column family: column"
hbase(main):014:0> get 'student','1001'
hbase(main):015:0> get 'student','1001','info:name'
7. Statistics table data rows
hbase(main):021:0> count 'student'
8. delete data
Delete all data of a rowkey:
hbase(main):016:0> deleteall 'student','1001'
Delete a column of data of a rowkey:
hbase(main):017:0> delete 'student','1002','info:sex'
9. Clear table data
hbase(main):018:0> truncate 'student'
Tip: The order of operations for emptying the table is disable first, then truncate.
10. Delete table
First, you need to make the table in the disabled state:
hbase(main):019:0> disable 'student'
Then we can drop this table:
hbase(main):020:0> drop 'student'
Tip: If you drop the table directly, an error will be reported: ERROR: Table student is enabled. Disable it first.
11. Change table information
Store the data in the info column family in 3 versions:
hbase(main):022:0> alter 'student',{NAME=>'info',VERSIONS=>3}
hbase(main):022:0> get
'student','1001',{COLUMN=>'info:name',VERSIONS=>3}

Six, HBase usage scenarios

First of all, HBase is based on HDFS for storage.

HDFS

Write once, read multiple times.
Ensure data consistency.
Mainly can be deployed in many cheap machines, through multiple copies to improve reliability, provide fault tolerance and recovery mechanism.

HBase

Scenarios where the amount of writes in an instant is large, and the database is not well supported or requires high-cost support.
The data needs to be stored for a long time, and the amount will continue to grow to larger scenes.
HBase is not applicable to data models with joins, multi-level indexes, and complex table relationships.
Large amount of data and demand for fast random access. Such as: Taobao transaction history. There is no doubt that the amount of data is huge, and requests for ordinary users must be responded to immediately.
The business scenario is simple and does not require many features in relational databases (such as cross columns, cross tables, transactions, connections, etc.).

Seven, HBase table data model

HBase logical table structure

HBase physical storage structure

（1）Name Space

Namespace, similar to the DatabBase concept of relational database, there are multiple tables under each namespace. HBase has two built-in namespaces, hbase and default. HBase stores the built-in HBase tables, and the default table is the namespace used by the user by default.

（2）Region

Similar to the table concept of a relational database. The difference is that HBase only needs to declare column families when defining a table, instead of declaring specific columns. This means that when writing data to HBase, fields can be specified dynamically and on demand. Therefore, compared with relational databases, HBase can easily cope with field changes.

（3）rowKey

Each row of data in the HBase table consists of a RowKey and multiple Columns. The data is stored in the lexicographical order of the RowKey, and can only be retrieved based on the RowKey when querying the data, so the design of the RowKey is very important. Like nosql database, rowKey is the primary key used to retrieve records. There are only three ways to access rows in hbase table:

Access through a single rowKey
Range by rowKey
Full table scan

rowKey The row key can be any string (the maximum length is 64KB, and the length in practical applications is generally 10-100 bytes). In the hbase, the rowKey is stored as a byte array. When Hbase stores the data in the table in rowkey order (dictionary order), the data is stored in row key lexicographical order (byte order). When designing the key, it is necessary to fully sort the storage feature, and store the rows that are frequently read together. (Position correlation).

（4) Column Family

Each column in the HBase table belongs to a certain column family. The column family is part of the table's schema (and the column is not) and must be defined before the table is used. The column names are prefixed with the column family. For example, courses:history and courses:math all belong to the courses family. Access control, disk and memory usage statistics are all performed at the column family level. The more column families, the more files to participate in IO and search when fetching a row of data, so if it is not necessary, do not set too many column families.

（5）Column

Each column in HBase is qualified by Column Family and Column Qualifier, such as info:name, info:age. When building a table, you only need to specify the column family, and the column qualifiers do not need to be defined in advance.

（6）Time Stamp

It is used to identify different versions of data. When each piece of data is written, if a timestamp is not specified, the system will automatically add this field to it, and its value is the time when it was written to HBase. A storage unit determined by row and columns in HBase is called a cell. Each cell stores multiple versions of the same data. Versions are indexed by timestamp. The type of the timestamp is a 64-bit integer. The timestamp can be assigned by hbase (automatically when data is written). At this time, the timestamp is the current system time accurate to milliseconds. The timestamp can also be explicitly assigned by the client. If the application wants to avoid data version conflicts, it must generate a unique timestamp by itself. In each cell, the data of different versions are sorted in reverse chronological order, that is, the latest data is ranked first.

In order to avoid the burden of management (including storage and indexing) caused by too many versions of data, hbase provides two data version recovery methods:

Save the last n versions of the data
Save the version in the most recent period of time (set the life cycle TTL of the data).

The user can make settings for each column family.

（7）Cell

The unit uniquely determined by {rowkey, column Family: column Qualifier, time Stamp}. The data in the cell has no type, and is all stored in bytecode format.

Eight, HBase read and write process

Read process

(1) Client access zookeeper first to obtain which Region Server the hbase:meta table is located in.
(2) Access the corresponding Region Server, obtain the hbase:meta table, and query which Region in which Region Server the target data is located according to the namespace:table/rowkey of the read request. And cache the region information of the table and the location information of the meta table in the meta cache of the client to facilitate next access.
(3) Communicate with the target Region Server.
(4) Query the target data in Block Cache (read cache), MemStore and Store File (HFile) respectively, and merge all the data found. All data here refers to different versions (time stamp) or different types (Put/Delete) of the same piece of data.
(5) Cache the data block (Block, HFile data storage unit, the default size is 64KB) queried from the file to the Block Cache.
(6) Return the final result after the merger to the client.

Write process

(1) Client access zookeeper first to obtain which Region Server the hbase:meta table is located in.
(2) Access the corresponding Region Server, obtain the hbase:meta table, and query which Region in which Region Server the target data is located according to the namespace:table/rowkey of the read request. And cache the region information of the table and the location information of the meta table in the meta cache of the client to facilitate next access.
(3) Communicate with the target Region Server;
(4) Write (append) data sequentially to WAL;
(5) Write the data to the corresponding MemStore, and the data will be sorted in the MemStore;
(6) Send an ack to the client;
(7) After reaching the flashing time of MemStore, flash the data to HFile.

Nine, HRegionServer downtime processing

0, overview

Precisely because the configuration of the machine is not very good plus the network hard disk and other reasons, the probability of machine downtime will be relatively large. As the actual execution node in the HBase cluster, RegionServer will inevitably experience downtime.

Downtime is not terrible, because no data will be lost. The downtime of a RegionServer in the HBase cluster (in fact, the failure of the RegionServer process) will not cause the data that has been written to be lost. Like MySQL and other databases, HBase uses the WAL mechanism to ensure this: it will write HLog first, and then write cache , And place the disk together after the cache is full. Even if the unexpected downtime results in a lot of cached data not being placed on the disk in time, it can be recovered through the HLog log.

However, no data loss does not mean that the downtime has no impact on the business side. As we all know, the downtime of RegionServer is first sensed by Zookeeper, and it takes some time for Zookeeper to perceive the downtime of RegionServer. During this time, all read-write routes will fall on it normally, and these read-writes will inevitably fail.

1. Processing flow

(1) When the RegionServer is down, the temporary node under the /hbase/rs node of the RegionServer registered with Zookeeper will go offline, and Zookeeper will notify the Master for failover as soon as possible.

(2) The Master first moves all Regions on this RegionServer to other RegionServers, and then distributes the HLog to other RegionServers for playback.

(3) Modify the routing again, and the business side's reading and writing will return to normal.

2, practice

(1) Check the RegionServer log

(2) Check system monitoring

10. Pre-partitioning

0, definition

Each region maintains StartRow and EndRow. If the added data meets the RowKey range maintained by a certain region, the data is handed over to this region for maintenance. Plan in advance the partitions where new data will be delivered to improve HBase performance and avoid hot issues. Among them, the planned number of partitions is related to the data volume and machine scale in the next six months or one year.

1. Advantages of pre-partitioning

Increase data read and write efficiency
Load balancing to prevent data skew/hot issues
Facilitate cluster disaster recovery scheduling region
Optimize the number of maps

2. Several ways of pre-partitioning

(1) Manually set the pre-partition

create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']

(2) Generate hexadecimal sequence pre-partition

create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

according to the rules set in the file

Create splits.txt file content as follows:
aaaa
bbbb
cccc
dddd

Then execute:
create 'staff3','partition3',SPLITS_FILE => 'splits.txt'

(4) Use JavaAPI to create pre-partition

//自定义算法，产生一系列 hash 散列值存储在二维数组中
byte[][] splitKeys = 某个散列值函数
//创建 HbaseAdmin 实例
HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create());
//创建 HTableDescriptor 实例
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
//通过 HTableDescriptor 实例和散列值二维数组创建带有预分区的 Hbase 表
hAdmin.createTable(tableDesc, splitKeys);

11. Load balancing of HRegion

HBase uses RowKey to split the table horizontally into multiple HRegions. Each HRegion records its StartKey and EndKey. Client can quickly locate which HRegion each RowKey is in through HMaster. HRegion is allocated by HMaster to the corresponding HRegion Split. The two new HRegions will initially be on the same HRegionServer as the previous parent HRegion. For load balancing considerations, HMaster may reallocate one or even two of them to other HRegionServers, which will cause some HRegionServer processing. The data is on other nodes until the next Major Compaction moves the data from the remote node to the local node. This is the load balancing of Hregion.

Twelve, rowKey design

HBase is stored in a three-dimensional order. The three dimensions of rowkey (row key), column key (column family and qualifier), and TimeStamp (time stamp) can quickly locate data in HBase.

The rowkey in HBase can uniquely identify a row of records. There are several ways to query HBase:

Through the get method, specify the rowkey to obtain the only record;
By scan mode, set the startRow and stopRow parameters to match the range;
Full table scan, that is, directly scan all rows in the entire table.

HBase rowKey design principles

1. Length principle

rowkey is a binary code stream, which can be any string, with a maximum length of 64kb. In practical applications, it is generally 10-100 bytes. It is stored in the form of byte[] and is generally designed as a fixed length. It is recommended that the shorter the better, not more than 16 bytes, the reasons are as follows:

The data persistent file HFile is stored according to KeyValue. If the rowkey is too long, such as more than 100 bytes, 1000w rows of data, the rowkey alone will occupy: 100 * 1000w=1 billion bytes, which is nearly 1G of data. Greatly affect the storage efficiency of HFile.
MemStore will cache part of the data in the memory. If the rowkey field is too long, the effective utilization of the memory will be reduced, and the system cannot cache more data, which will reduce the retrieval efficiency.

2. Hashing principle

If the rowkey is incremented according to the timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowKey as the hash field, which is randomly generated by the program, and the low bit is placed in the time field, which will improve the balanced distribution of data in each RegionServer. In order to achieve the probability of load balancing.
If there is no hash field, the first field is directly the time information, and all data will be concentrated on one RegionServer. In this way, the load will be concentrated on individual RegionServers during data retrieval, causing hot issues and reducing query efficiency.

3. The unique principle

It must be designed to ensure its uniqueness. The rowkey is sorted and stored in lexicographic order. Therefore, when designing the rowkey, we must make full use of the characteristics of this sort, and store the frequently read data in one block, and store the data that may be accessed recently. Put the data together.

Other design suggestions:

Minimize the size of row keys and column families. In HBase, the value is always transmitted with its key. When a specific value is transferred between systems, its rowkey, column name, and timestamp will also be transferred together. If your rowkey and column names are large, they will take up a lot of storage space at this time.
The column family should be as short as possible, preferably one character.
Longer attribute names are more readable, but shorter attribute names are better stored in HBase.

13. Causes and solutions of hot spots/data skew in HBase

1. What is a hotspot

The rows in HBase are sorted according to the lexicographical order of the rowkey. This design optimizes the scan operation. Related rows and rows that will be read together can be accessed in adjacent locations for easy scanning. However, poor rowkey design is the source of hot spots. Hot spots occur when a large number of clients directly access one or a few nodes of the cluster (access may be read, write or other operations). A large number of visits will make the single machine where the hot region is located beyond its own capacity, causing performance degradation or even region unavailability, which will also affect other regions on the same RegionServer, because the host cannot service requests from other regions. Design a well-designed data access model so that the cluster is fully and evenly utilized. In order to avoid writing hotspots, the rowkey is designed so that different rows are in the same region, but in the case of more data, the data should be written to multiple regions of the cluster instead of one.

2. Common ways to avoid hot spots and their advantages and disadvantages

(1) with salt

The salting mentioned here is not salting in cryptography, but adding a random number in front of the rowkey. Specifically, it is to assign a random prefix to the rowkey to make it different from the beginning of the previous rowkey. The number of prefix types allocated should be the same as the number of data you want to use to distribute to different regions. With salt
The subsequent rowkey will be scattered to each region according to the randomly generated prefix to avoid hot spots.

(2) Hash

Hashing makes the same line always salted with a prefix. Hashing can also spread the load across the entire cluster, but the reads are predictable. Using a certain hash allows the client to reconstruct the complete rowkey, and use the get operation to accurately obtain a certain row of data.

(3) Reverse

The third method to prevent hot spots is to reverse the rowkey of fixed length or number format. This allows the frequently changing part (the least meaningful part) of the rowkey to be placed first. This can effectively randomize the rowkey, but at the expense of the orderliness of the rowkey. The example of reversing the rowkey uses the mobile phone number as the rowkey, and the string after the mobile phone number is reversed can be used as the rowkey, so as to avoid the problem of hot spots caused by the fixed beginning of the mobile phone number.

(4) Timestamp inversion

A common data processing problem is to quickly obtain the latest version of the data. Using the reversed timestamp as part of the rowkey is very useful for this problem. You can use Long.Max_Value-timestamp to append to the end of the key, such as key, [key] The latest value can be obtained by scan [key] to get the first record of [key], because the rowkey in HBase is ordered, and the first record is the last entered data.

14. Coprocessor of HBase

0, background

The most frequently criticized features of HBase as a column family database include the inability to easily establish a "secondary index", and it is difficult to perform operations such as summation, counting, and sorting. If you do some simple addition or aggregation calculations, the calculation process is directly placed on the server side, which can reduce the communication overhead and obtain a good performance improvement. As a result, HBase introduced coprocessors after 0.92 to implement some new features: the ability to easily create secondary indexes, complex filters (predicate pushdown), and access control.

1. Two coprocessors: observer and endpoint

1.1 Observer

Observer is similar to a trigger in a traditional database. This type of coprocessor will be called by the server when certain events occur. Observer Coprocessor are hooks scattered in the HBase Server code, which are called when a fixed event occurs. For example, there is a hook function prePut before the put operation, which will be called by the Region Server before the put operation is executed; after the put operation, there is a postPut hook function.

Take HBase version 0.92 as an example, it provides three observer interfaces:

RegionObserver: Provides client data manipulation event hooks: Get, Put, Delete, Scan, etc.
WALObserver: Provides WAL related operation hooks.
MasterObserver: Provides DDL-type operation hooks. Such as creating, deleting, and modifying data tables.

The following figure uses RegionObserver as an example to explain the principle of Observer as a coprocessor:

(1) The client sends a put request

(2) The request is dispatched to the appropriate RegionServer and region

(3) coprocessorHost intercepts the request, and then calls prePut() on each RegionObserver registered on the table

(4) If it is not intercepted by prePut(), the request continues to be sent to the region, and then processed

(5) The result generated by the region is intercepted by CoprocessorHost again, and postPut() is called

(6) If the response is not intercepted by postPut(), the final result is returned to the client

1.2 Endpoint

Endpoint coprocessors are similar to stored procedures in traditional databases. Clients can call these Endpoint coprocessors to execute a piece of server-side code and return the results of the server-side code to the client for further processing. The most common usage is to perform aggregation operations. If there is no coprocessor, when users need to find the maximum data in a table, that is, the max aggregation operation, they must perform a full table scan, traverse the scan results in the client code, and perform the maximum value operation. Such a method cannot take advantage of the concurrency capabilities of the underlying cluster, and centralized execution of all calculations on the client side is bound to be inefficient. With Coprocessor, users can deploy the code for finding the maximum value to the HBase Server side, and HBase will use multiple nodes of the underlying cluster to concurrently execute the operation for finding the maximum value. That is, execute the code to find the maximum value in each Region, calculate the maximum value of each Region on the Region Server side, and only return the max value to the client. The client further processes the maximum value of multiple Regions to find the maximum value. In this way, the overall execution efficiency will be improved a lot.

The working principle of EndPoint is shown in the figure:

1.3 compared to

Observer is similar to triggers in RDBMS and mainly works on the server side
Endpoint is similar to the stored procedure in RDBMS, which mainly works on the server side

Observer allows the cluster to have different behaviors during normal client operations
Endpoint allows to expand the capabilities of the cluster and open new computing commands to client applications

Observer can realize functions such as authority management, priority setting, monitoring, ddl control, secondary index, etc.
Endpoint can realize min, max, avg, sum, distinct, group by and other functions

2. Coprocessor loading mode

There are two ways to load the coprocessor: Static Load and Dynamic Load

The statically loaded coprocessor is called System Coprocessor
The dynamically loaded coprocessor is called Table Coprocessor

15. Secondary index

Why do you need HBse secondary index

Since HBase's query is relatively weak, if you need to implement similar

select name,salary,count(1),max(salary) from user group by name,salary order by salary

It is basically impossible or more difficult to wait for such complex statistical requirements, so when we use HBase, we generally implement it with the help of a secondary index scheme. The primary index of HBase is rowKey, and we can only retrieve it by rowkey. If you want to perform data retrieval and query on non-rowkey fields in the library, it is often done through a distributed computing framework such as MapReduce/Spark, and hardware resource consumption and time delay will be relatively high.

In order for HBase data query to be more efficient and adapt to more scenarios, such as using non-rowkey field retrieval to achieve second-level response, or supporting fuzzy query and multi-field combination query for each field, it is necessary to build a secondary level on HBase Index to meet more complex and diverse business needs in reality.

HBase secondary index scheme:

1. Based on the Coprocessor solution

The general idea of implementing secondary indexes based on Coprocessor: build an "index" mapping relationship and store it in another hbase table or other DB. The industry's more well-known open source solutions based on Coprocessor:

Huawei's hindex: Based on version 0.94, it was relatively popular when it first came out that year, but the version is older, and the GitHub project address has not been updated in recent years.
Apache Phoenix : The function revolves around SQL on HBase, which supports and is compatible with multiple HBase versions. The secondary index is only one of the functions. The creation and management of secondary indexes are directly supported by SQL syntax, which is very easy to use. The current community activity and version update iterations of the project are relatively good.

Apache Phoenix is a better choice among the current open source solutions. Mainly SQL on HBase, based on SQL, it can complete HBase CRUD operations, and supports JDBC protocol. The position of Apache Phoenix in the Hadoop ecosystem:

Phoenix secondary index features:

Covered Indexes (covered index): attach the data field of interest to the index table, and only need to return the data (column) to be queried through the index table, so the indexed column must contain the required query column (SELECT column) And WHRER column).
Functional indexes: Indexes are not limited to columns and support arbitrary expressions to create indexes.
Global indexes: Suitable for scenarios where read more and write less. By maintaining the global index table, all updates and write operations will cause index updates, and write performance will be affected. When reading data, Phoenix SQL will execute fast queries based on indexed fields.
Local indexes: Suitable for scenarios where write more and read less. When data is written, both index data and table data are stored locally. When reading data, since the location of the region cannot be determined in advance, it is necessary to check each region (to find index data) when reading data, which will bring a certain performance (network) overhead.

and cons of the Coprocessor-based solution:

Advantages : Based on the Coprocessor solution, from the perspective of development and design, many details of the secondary index management are encapsulated in the Coprocessor specific implementation class. These details are unaware to people who read and write outside, which simplifies Use of data visitors.

Disadvantages : However, Coprocessor's solution is more intrusive. The code logic that needs to run and maintain the secondary index relationship table in the Regionserver is added, which will have a certain impact on the performance of the Regionserver.

2. Non-Coprocessor solution

Choosing not to develop based on Coprocessor, and to build and maintain the index relationship externally is another way.

It is common to use Elasticsearch (hereinafter referred to as ES) or Apache Solr based on Apache Lucene to build powerful indexing capabilities and search capabilities, such as supporting fuzzy queries, full-text retrieval, combined queries, and sorting.

16. Tuning

1. General optimization

(1) The metadata backup of NameNode uses SSD.

(2) Regularly back up the metadata on the NameNode, hourly or daily. If the data is extremely important, it can be backed up every 5-10 minutes. The backup can copy the metadata directory through a scheduled task.

(3) Specify multiple metadata directories for NameNode, use dfs.name.dir or dfs.namenode.name.dir to specify. One designated local disk and one designated network disk. This provides redundancy and robustness of metadata to avoid failures.

(4) Set dfs.namenode.name.dir.restore to true to allow attempts to restore the previously failed dfs.namenode.name.dir directory. This is attempted when creating a checkpoint. If multiple disks are set, it is recommended to allow it.

(5) The NameNode node must be configured as a RAID1 (mirrored disk) structure.

(6) Keep enough space in the NameNode log directory. These logs will help you find problems.

(7) Because Hadoop is an IO-intensive framework, try to improve storage speed and throughput (similar to bit width).

2, Linux optimization

(1) Turning on the read-ahead cache of the file system (set readahead) can improve the read speed

$ sudo blockdev --setra 32768 /dev/sda

(2) Close the process sleep pool

$ sudo sysctl -w vm.swappiness=0

3. HDFS optimization (hdfs-site.xml)

(1) Ensure that RPC calls will have more threads

Property: dfs.namenode.handler.count
Explanation: This attribute is the default number of threads of the NameNode service. The default value is 10, which can be adjusted to 50~100 according to the available memory of the machine.

Property: dfs.datanode.handler.count
Explanation: The default value of this attribute is 10, which is the number of processing threads of the DataNode. If the HDFS client program has a lot of read and write requests, it can be adjusted to 15-20. The larger the value, the more memory consumption. Don’t adjust it too high. , In general business, 5~10 is enough.

(2) Adjustment of the number of copies

Property: dfs.replication
Explanation: If the amount of data is huge and not very important, it can be adjusted to 2~3, if the data is very important, it can be adjusted to 3~5.

(3) Adjustment of file block size

Property: dfs.blocksize
Explanation: The block size definition. This attribute should be set according to the size of a large number of single files stored. If a large number of single files are less than 100M, it is recommended to set the block size to 64M. For the case of greater than 100M or up to GB, it is recommended to set it to 256M, the general setting range fluctuates between 64M~256M.

4. MapReduce optimization (mapred-site.xml)

(1) Adjustment of the number of job service threads

mapreduce.jobtracker.handler.count
This attribute is the number of Job task threads, the default value is 10, and it can be adjusted to 50~100 according to the available memory of the machine.

（2）HTTP

Attribute: mapreduce.tasktracker.http.threads
Explanation: Define the number of HTTP server worker threads. The default value is 40. For large clusters, it can be adjusted to 80~100.

(3) File sorting and merging optimization

Attribute: mapreduce.task.io.sort.factor
Explanation: The number of data streams that are merged at the same time when sorting files. This also defines the number of open files at the same time. The default value is 10. If you increase this parameter, you can significantly reduce disk IO, that is, reduce the number of file reads.

(4) Set task concurrency

Property: mapreduce.map.speculative
Explanation: This attribute can set whether tasks can be executed concurrently. If there are many but small tasks, setting this attribute to true can significantly speed up task execution efficiency. However, for tasks with very high latency, it is recommended to change to false, which is similar to Thunder download.

(5) Compression of MR output data

Attributes: mapreduce.map.output.compress, mapreduce.output.fileoutputformat.compress
Explanation: For large clusters, it is recommended to set the output of Map-Reduce to compressed data, but for small clusters, it is not necessary.

(6) Optimize the number of Mapper and Reducer
Attributes: mapreduce.tasktracker.map.tasks.maximum, mapreduce.tasktracker.reduce.tasks.maximum
Explanation: The above two attributes are the number of Map and Reduce that a single Job task can run at the same time. When setting the above two parameters, you need to consider the number of CPU cores, disk and memory capacity. Assuming an 8-core CPU, the business content is very CPU intensive, then the number of maps can be set to 4. If the business is not particularly CPU-consuming, then the number of maps can be set to 40 and the number of reduce is 20. After modifying the values of these parameters, be sure to observe whether there are longer waiting tasks. If so, you can reduce the number to speed up the task execution. If you set a large value, it will cause a lot of context switching, and memory and There is no standard configuration value for the data exchange between disks. You need to make a choice based on your business and hardware configuration and experience. At the same time, do not run too many MapReduce at the same time. This will consume too much memory and the task will be executed very slowly. We need to set a maximum concurrent MR task according to the number of CPU cores and memory capacity to make a fixed amount of data Tasks are completely loaded into memory, avoiding frequent memory and disk data exchange, thereby reducing disk IO and improving performance.

Approximate estimation formula:
map = 2 + 2/3cpu_core
reduce = 2 + 1/3cpu_core

5. HBase optimization

(1) Optimize the maximum number of open files allowed by the DataNode

Property: dfs.datanode.max.transfer.threads
File: hdfs-site.xml
Explanation: HBase generally operates a large number of files at the same time, and it is set to 4096 or higher according to the number and scale of the cluster and data actions. Default value: 4096

(2) Optimize the latency of high-latency data operations

Property: dfs.image.transfer.timeout
File: hdfs-site.xml
Explanation: If the delay is very high for a certain data operation, the socket needs to wait a longer time, it is recommended to set this value to a larger value (the default is 60000 milliseconds) to ensure that the socket will not be timeout out.

(3) Optimize data writing efficiency

Attributes: mapreduce.map.output.compress, mapreduce.map.output.compress.codec
File: mapred-site.xml
Explanation: Enabling these two data can greatly improve the writing efficiency of the file and reduce the writing time. The first attribute value is changed to true, and the second attribute value is changed to:
org.apache.hadoop.io.compress.GzipCodec

(4) Optimize DataNode storage

Property: dfs.datanode.failed.volumes.tolerated
File: hdfs-site.xml
Explanation: The default value is 0, which means that when a disk in the DataNode fails, it will be considered that the DataNode is shutdown. If it is modified to 1, when a disk fails, the data will be copied to other normal DataNodes, and the current DataNode will continue to work.

(5) Set the number of RPC monitoring

Property: hbase.regionserver.handler.count
File: hbase-site.xml
Explanation: The default value is 30, which is used to specify the number of RPC monitoring, which can be adjusted according to the number of requests from the client. When there are many read and write requests, increase this value.

(6) Optimize HStore file size

Attribute: hbase.hregion.max.filesize
File: hbase-site.xml
Explanation: The default value is 10737418240 (10GB). If you need to run HBase MR tasks, you can reduce this value, because a region corresponds to a map task. If a single region is too large, the execution time of the map task will be too long. This value means that if the size of the HFile reaches this value, the region will be divided into two Hfiles.

(7) Optimize the hbase client cache

Property: hbase.client.write.buffer
File: hbase-site.xml
Explanation: Used to specify the HBase client cache. Increasing this value can reduce the number of RPC calls, but it will consume more memory, and vice versa. Generally, we need to set a certain cache size to achieve the purpose of reducing the number of RPCs.

(8) Specify the number of rows obtained by scanning HBase by scan.next

Property: hbase.client.scanner.caching
File: hbase-site.xml
Explanation: Used to specify the default number of rows obtained by the scan.next method. The larger the value, the more memory will be consumed.

6, memory optimization

HBase operation requires a lot of memory overhead. After all, Table can be cached in memory. Generally, 70% of the entire available memory is allocated to HBase's Java heap. However, it is not recommended to allocate a very large heap memory, because the GC process continues for too long will cause the RegionServer to be in a long-term unavailable state, generally
16~48G memory is fine. If the system memory is insufficient due to the high memory occupied by the framework, the framework will also be dragged to death by the system service.

7. JVM optimization (hbase-env.sh)

(1) Parallel GC

Parameters:·-XX:+UseParallelGC·
Explanation: Turn on parallel GC.

(2) The number of threads simultaneously processing garbage collection

Parameters: -XX:ParallelGCThreads=cpu_core – 1
Explanation: This attribute sets the number of threads that process garbage collection at the same time.

(3) Disable manual GC

Parameters: -XX:DisableExplicitGC
Explanation: Prevent developers from manually calling GC.

8, Zookeeper optimization

optimized Zookeeper session timeout time

Parameters: zookeeper.session.timeout
File: hbase-site.xml
Explanation: This value is directly related to the maximum period for the master to discover that the server is down. The default value is 30 seconds. If the value is too small, it will cause the RegionServer to be temporarily unavailable when a large amount of data is written in HBase and GC occurs. The heartbeat packet was not sent to ZK, which eventually led to the belief that the slave node was shut down. Generally, a cluster of about 20 units needs to configure 5 Zookeepers.

Summary of HBase knowledge points

One, HBase basics

2. Comparison of HDFS, Hive and HBase

Third, the ability of HBase in commercial projects

Four, HBase system architecture

Five, HBase shell operation

Six, HBase usage scenarios

Seven, HBase table data model

Eight, HBase read and write process

Nine, HRegionServer downtime processing

10. Pre-partitioning

11. Load balancing of HRegion

Twelve, rowKey design

13. Causes and solutions of hot spots/data skew in HBase

14. Coprocessor of HBase

15. Secondary index

16. Tuning

西柚

引用和评论

被 Manus 带火的 MCP 是什么｜一文看懂

Dolphinscheduler IDEA本地调试

分布式数据库解析

【Hadoop】HDFS架构解析

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

【Hadoop】HBase系统解析及适用场景

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用