1

1. Background of the program

At this stage, some business data is stored in HBase, and the volume of this part of the data is large, reaching billions. Big data needs to incrementally synchronize this part of the business data to the data warehouse for offline analysis. At present, the main synchronization method is realized through HBase's hive mapping table. This approach has the following pain points:

  • It is necessary to perform a full table scan on the HBase table, which puts a certain pressure on the HBase library, and the synchronization speed of the synchronization data is slow.
  • After the business side changes the fields of the HBase table, it is necessary to rebuild the hive mapping table, which brings certain difficulties to the maintenance of permissions.
  • The business side cannot effectively monitor the changes of HBase table fields, and cannot perceive the newly added fields in time, which brings certain difficulties to the maintenance of the data warehouse.
  • The timestamp is not updated when the business side updates the data, resulting in missing data when incrementally extracting through the timestamp field.
  • The business side cannot perceive the updates and additions of table fields in time, resulting in incomplete fields that require backtracking data.

Based on the above background, for the scenario of incremental synchronization of HBase data to the data warehouse, a general solution is given to solve the above pain points.

2. Brief description of the plan

2.1 Data warehousing construction process

2.2 Experiment comparison of HBase data warehousing scheme

The rationality analysis of the above three implementation schemes is carried out respectively.

2.2.1 Scheme 1

uses HBase's hive mapping table . The implementation of this solution is simple, but it does not conform to the implementation mechanism of data warehouses. The main reasons are:

  • Although the HBase table is a NoSQL database of the Hadoop ecosystem, as the database of the business side, it is directly read through the hive mapping table, which is analogous to directly reading the view in the business side Mysql, which may cause certain pressure on the business side database. It will even affect the normal operation of the business, violating the principle that the number of warehouses will affect the operation of the business as little as possible.
  • Through the method of hive mapping table, in terms of implementation, the coupling degree with the business side is increased, which violates the decoupling principle of data warehouse construction.

Therefore, this kind of scheme should not be adopted in this practical application scenario.

2.2.2 Scheme 2

grabs the incremental data according to the timestamp field in the business table. Since HBase is a NoSQL database based on rowKey, there will be the following problems:

  • It is necessary to scan the entire table, and then filter the increment of the day according to the timestamp (updateTime). When the amount of data reaches tens of millions or even hundreds of millions, the execution efficiency is very low, and the running time is very long.
  • When HBase table updates data, unlike MySQL, it can automatically update the timestamp, which will cause the business side to not update the timestamp in time, so when incremental data is extracted, data will be missing.

Therefore, there are certain risks in this scheme.

2.2.3 Scheme 3

According to the timeRange feature of HBase (when HBase writes data, the timestamp is recorded, and the server time is used), first filter out the incremental rowKey, and then go to HBase to query the corresponding data . This implementation scheme solves the problems of scheme one and scheme two at the same time. At the same time, it can effectively monitor the business side's newly added HBase table fields, avoid the problem of data missing caused by the business side's failure to notify in time, and minimize the frequency of data backtracking.

To sum up, scheme 3 is adopted as the solution to realize the warehousing of HBase massive data.

2.3 Scheme selection and implementation principle

Based on the characteristics of TimeRange that will be updated when HBase data is written, if TimeRange is specified during scan, then there is no need to scan the entire table, and the corresponding rowKey can be obtained directly according to TimeRange, and then the incremental information can be obtained according to the rowKey, which can achieve fast Efficiently acquire incremental data.

Why go to get after scan? The main reason is that the data obtained through timeRanme only contains columns updated within this time range, and all fields corresponding to this rowkey cannot be queried. For example, a rowkey has two fields, name and age, and only the age field is updated within the specified time range. When scanning, only the age field can be queried, but the name field cannot be queried, so you need to get it again. At the same time, the columns corresponding to the incremental data are obtained, compared with the meta data of the hive table, and the field changes are warned in time to reduce the subsequent occurrence of full initialization due to less synchronization of field content. The schematic diagram of its implementation is as follows:

3. Effect comparison

The running time comparison is as follows (unit: seconds):

4. Summary and Outlook

The data of the data warehouse comes from various business systems. Efficiently and accurately synchronizing the data of the business system to the data warehouse is the foundation of the data warehouse construction. Through this solution, several major pain points in the process of data synchronization are mainly solved, which can better ensure the quality of data entering the warehouse, and lay a good foundation for the subsequent construction of data warehouses.

In addition, through the comparison of many experiments and the feasibility analysis of various schemes, the data synchronization scheme is synchronized to the one-stop big data development platform, and the big data development platform is promoted to support the incremental synchronization function based on timeRange. Platformization and configuration, which solves the pain point of HBase massive data warehousing.

At the same time, in addition to the above solutions, you can also try to use the secondary index in combination with Phoenix, and then synchronize to the data warehouse by querying the Phoenix table, which will be tested for performance later.

Author: vivo Internet Big Data Team - Tang Xicheng

vivo互联网技术
3.3k 声望10.2k 粉丝