split函数 - Distributed | How to quickly import data into dble through the split function of dble - MySQL分布式中间件DBLE

Author: Ma Yingle
A member of the R&D team of Aikesen, responsible for the testing of mysql middleware. I am a testing technology enthusiast, welcome to try the new features of dble~
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

Introduction to the split function

When the old business needs to be transformed into a distributed business based on dble, it will face the problem of splitting and importing existing historical data. There are various import and export methods supported by dble. For details, please refer to document 3.11.1 . This time we introduce The split function can be understood as an import process accelerator, so how does it speed up?

You can consider such a scenario: a sql file (hereinafter referred to as "dump file") is dumped from a piece of raw data through the mysqldump tool. Under normal circumstances, the dump file will not be too small, and directly hold the dump file through dble It may take a while to complete the import from the business side of the server, and this process will definitely be slower than importing data directly into MySQL. In case of any errors during data import, it will be difficult to troubleshoot.

When I get the dump file, can I only import data by directly connecting to the dble business side to realize the split and import of historical data? Is there a possible way to first split the dump file into dump sub-files according to the configuration, and then we can take these split dump sub-files and directly import data to the corresponding back-end MySQL Woolen cloth?

So the split function of dble came into being. Here it is to do the work of dump file splitting. The general process of its work is to first process the dump files according to the shards according to the configuration of the sub-database and sub-tables. Of course, no matter what you configure in sharding.xml is shardingTable, globalTable, singleTable, or all of them, it will be reflected in the generated dump sub-files. We can directly import these dump sub-files into the corresponding shards. In the back-end MySQL, when the import operation of the back-end data is completed, it is only necessary to synchronize the metadata information of the dble, thus completing the split and import of the historical data.

After split processing, the generated dump subfile example:

However, when using mysqldump to export historical data sql files, you need to export in the following format, otherwise errors may occur, because some mysqldump parameters are not supported by dble.

 ./mysqldump -h127.0.0.1 -utest -P3306 -p111111 --default-character-set=utf8mb4 --master-data=2 --single-transaction --set-gtid-purged=off --hex-blob --databases schema1 --result-file=export.sql

For other notes, see document 3.11.2

Basic introduction

Syntax and Examples

 mysql > split src dest [-sschema] [-r500] [-w512] [-l10000] [--ignore] [-t2]

src：表示原始dump文件名
dest：表示生成的dump文件存放的目录
-s：表示默认逻辑数据库名，当dump文件中不包含schema的相关语句时，会默认导出到该schema。如：当dump文件中包含schema时，dump文件中的优先级高于-s指定的；若文件中的schema不在配置中，则使用-s指定的schema，若-s指定的schema也不在配置中，则返回报错
-r：表示设置读文件队列大小，默认500
-w：表示设置写文件队列大小，默认512，且必须为2的次幂
-l：表示split后一条insert中最多包含的values，只针对分片表,默认4000
--ignore：insert时，忽略已存在的数据
-t：表示多线程处理文件中insert语句时线程池的大小

Example of use:

 mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1 -r600;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1 -r600 -w1024;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1 -r600 -w1024 -l10000;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1 -r600 -w1024 -l10000 --ignore;
mysql > split /path-to-mysqldump-file/mysqldump.sql /tmp/dump-dir-name -sdatabase1 -r600 -w1024 -l10000 --ignore -t4;

An example of split execution:

According to this example, you can view the dump subfile generated by split in the specified directory /tmp/split40k :

For more detailed usage instructions, please refer to the introduction section of the split command in the documentation.

split performance evaluation

After talking so much, how fast is it? Will the disassembled data be lacking?

We're going to do a test to see. In the same test environment, three groups of tests were prepared, as follows:

Control group 1: For the same dump file, without using dble, directly connect to MySQL to import the time-consuming of MySQL as a whole, and obtain the total number of rows in each table, which is used as a benchmark for whether there are problems with the data imported by other test groups.
Control group 2: In the same dump file, under the same environment, the time-consuming to directly connect to dble to import data, as well as the total number of rows in each table, the number of rows in each table on each shard and the checksum value
Experimental group: In the same dump file, under the same environment, the time-consuming of split processing + import, and the total number of rows in each table, the number of rows and checksum values in each table on each shard

Note: Because the data has been split, the dble business side does not support the syntax of checksum table , so it is difficult to compare the overall checksum value of each table in the original MySQL from the level of table checksum value. The total number of rows in each table in the three groups of tests, as well as the row number and checksum value of each table corresponding to each shard in the control group 2 and experimental group are compared.

test environment

Test using dble: single node dble, version 5.7.11-dble-3.22.01.0-e5d394e5994a004cd11b28dc5c171904769adad3-20220427091228
Data file source:

Use benchmarksql to create 1000 warehouse data as the data source (the table structure used in this test does not add foreign key relationships), the dump file obtained by mysqldump is about 75G

This experiment uses 10 shards to test, and since the data import time of each shard is proportional to the size of the data, the splitting algorithm for modulo is adopted, so that the data can be evenly distributed on each shard .

split command: The experimental group uses the split command /opt/splitTest/benchmarksql_with_data.sql /tmp/splittest -sbenchmarksql;
dble configuration:

In this experiment, 4 MySQL instances are used as the back-end MySQL, and 10 shards are uniformly used as the configuration of the control group 2 and the experimental group. The relevant configurations are as follows:

sharding.xml

 <?xml version="1.0"?>
<!DOCTYPE dble:sharding SYSTEM "sharding.dtd">
<dble:sharding xmlns:dble="http://dble.cloud/" version="4.0">

<schema name="benchmarksql">
        <globalTable name="bmsql_config" shardingNode="an$1-10" checkClass="CHECKSUM" cron="0 0 0 * * ?"></globalTable>
        <globalTable name="bmsql_item" shardingNode="an$1-10" checkClass="CHECKSUM" cron="0 0 0 * * ?"></globalTable>
        <shardingTable name="bmsql_warehouse" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="w_id"></shardingTable>
        <shardingTable name="bmsql_district" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="d_w_id"></shardingTable>
        <shardingTable name="bmsql_customer" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="c_w_id"></shardingTable>
        <shardingTable name="bmsql_history" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="h_w_id"></shardingTable>
        <shardingTable name="bmsql_new_order" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="no_w_id"></shardingTable>
        <shardingTable name="bmsql_oorder" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="o_w_id"></shardingTable>
        <shardingTable name="bmsql_order_line" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="ol_w_id"></shardingTable>
        <shardingTable name="bmsql_stock" shardingNode="an$1-10" function="benchmarksql-mod" shardingColumn="s_w_id"></shardingTable>
</schema>

    <shardingNode name="an1" dbGroup="ha_group1" database="dh_dn_1"/>
    <shardingNode name="an2" dbGroup="ha_group2" database="dh_dn_2"/>
    <shardingNode name="an3" dbGroup="ha_group3" database="dh_dn_3"/>
    <shardingNode name="an4" dbGroup="ha_group4" database="dh_dn_4"/>
    <shardingNode name="an5" dbGroup="ha_group1" database="dh_dn_5"/>
    <shardingNode name="an6" dbGroup="ha_group2" database="dh_dn_6"/>
    <shardingNode name="an7" dbGroup="ha_group3" database="dh_dn_7"/>
    <shardingNode name="an8" dbGroup="ha_group4" database="dh_dn_8"/>
    <shardingNode name="an9" dbGroup="ha_group1" database="dh_dn_9"/>
    <shardingNode name="an10" dbGroup="ha_group2" database="dh_dn_10"/>

 <function name="benchmarksql-mod" class="Hash">
        <property name="partitionCount">10</property>
        <property name="partitionLength">1</property>
    </function>
</dble:sharding>

Experimental procedure

In the same test environment, the control group 1, 2 and the experimental group were tested respectively. The experimental group did not transfer the dump subfile to its corresponding backend to perform local import, but remotely connected to the respective backend MySQL service on the machine where the dump subfile was located, and simultaneously imported it and started timing, because it was a concurrent import , so the import time depends on the longest time-consuming backend MySQL node.

When the dump subfiles of each shard are imported, you can execute reload @@metadata; on the dble management side to reload all metadata information.

Then you can:

Time-consuming to obtain the imported data of each of the 3 sets of tests
Check whether the total number of rows in each of the 10 tables is exactly the same in the 3 groups of tests. Among them, the control group 2 and the experimental group (that is, the import performed by direct connection to dble and the import performed by split) can be directly queried through the dble business side select count(*) from tb_name;
For the control group 2 and the experimental group, you can also check the checksum value of each table corresponding to each checksum table tb_name; and the number of rows select count(*) from tb_name; through the back-end MySQL.

Test record

control group

control group 1

The same mysqldump file (75G), directly connected to MySQL for overall import without using dble, time-consuming statistics: 13181s

control group 2

The same mysqldump file (75G), without split, (the mysqldump file is on the host where dble is located) directly connected to dble import time statistics: 50883s

test group

Time-consuming statistics of the same mysqldump file (75G), after split processing + (connecting to the back-end MySQL remotely on the dble local machine) concurrently imported to the back-end MySQL: 912s+1839s=2751s

Data comparison:

In the 3 sets of tests, the total number of rows in the 10 tables related to benchmarksql is exactly the same. The checksum value of each table corresponding to each shard in the backend of the control group 2 and the experimental group (that is, the import performed by direct connection to dble and the import performed by split) and the number of rows are the same.

test results:

In this test:

Import rate comparison: For the same mysqldump file (75G), the split import rate is 5 times the direct overall MySQL import rate, and 18 times the overall direct import rate through dble. The import speed of split reaches 98G/h.
Import correctness comparison: The way of importing data through split is consistent with the final result of importing data through the direct connection to the dble service end.

summary

In theory, when the performance of the machine executing the split command is good enough and the MySQL server is sufficient, the import speed can be further improved. For example, we can try the following strategies:

Increase the number of shards appropriately
The choice of splitting algorithm, the planning data is more evenly distributed on each backend MySQL server
The dump subfile can also be transferred to the backend MySQL machine first, and then imported to reduce the consumption on the network

As powerful as split, there are still some limitations in use, such as:

Explicitly configured childTables are not supported (but tables with foreign key relationships are supported in sharding.xml with smart ER relationships)
view is not supported
For a table that uses a global sequence, the values in the original global sequence of the table will be erased and replaced with a global sequence, which requires attention.

For more details, please refer to the introduction of the split command

other

Test dble split function execution + import time-consuming shell script reference, interested parents can click
Check it out here .

Distributed | How to quickly import data into dble through the split function of dble