Practice of batch inserting large amounts of data using Mybatis

I. Introduction

In recent days, when developing the company's business, I encountered a scenario where a large amount of data needs to be inserted into multiple tables in the database, so I have this article:

Considerations when using Mybatis insert data in batches, and a simple encapsulation of batched data insertion methods using functional programming.

For most of the Java back-end developers CURD development work, it is inevitable to use the Mybatis tool to operate the database.

In SpringBoot project, it is introduced Mybatis after use mapper be implemented CRUD way of injection.

For example, if you want to add a new piece of data, you can write this mapper.xml

        <insert id="testInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
             (#{value1},#{value2})
    </insert>

Then call mapper.insertOneItem(insertItem); on the service layer.

If you want to add multiple pieces of data, if you are Java , you may write this:

for(int i = 0; i < insertList.size(); i++){
  mapper.insertOneItem(insertList.get(i));
}

It is simply placed in a loop, and the new database method in mapper This method is simple and easy to understand, and there is no problem when the amount of data is relatively small.

But once there is too much data, there will be problems.

In fact, each mapper method invocation is a connecting to the database, preprocessing (PreparedStatement), and executing (executing SQL).

Thus found, if placed for cycle, then the process will be executed multiple times, and we know the number of connections to the database is limited (for connection establishment is very resource-intensive) If for when too many cycles , not only the performance will drop, but also the database will be blocked or even the program will crash. Of course, we can create or configure database connection pools (such as HikariCP、Durid etc.) to reuse connections, but this is still a waste of resources.

All in all, if there is a way to get what you need to do all at once, don't do it multiple times.

Most new statements in databases support inserting multiple pieces of data at one time.

insert into table
(value1, value2)
values ('v1','v2'),('v3','v4'),('v1','v5')

Mybatis also gives a method for batch manipulation of data. Using <foreach> tag in the dynamic SQL statement will help us splice out the SQL statement similar to the above.

We write the following mapper.xml

    <insert id="testMassiveInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
        <foreach collection="list" item="item" separator=",">
             (#{item.value1},#{item.value2})
        </foreach>
    </insert>

In this way, we only need to call the method in mapper for loop code.

And actually execute SQL once. The function of this dynamic SQL statement is to splice the content of the incoming parameter into SQL of the insert statement (preprocessing technology).

This approach is obviously better than the for loop implementation.

`2. An error is reported when the amount of data inserted in batches reaches tens of thousands`

However, when we use the above-mentioned method of splicing SQL for batch insertion, there will also be problems when the amount of data is too large!

We can first experiment to see what happens when a batch of 40,000 pieces of data is inserted.

First, create a new table as the target table for inserting data: massive_data_insert_test.

CREATE TABLE "supply"."massive_data_insert_test" (
  "value1" varchar(255) COLLATE "pg_catalog"."default",
  "value2" varchar(255) COLLATE "pg_catalog"."default"
);

Just connect to the database and create mapper in a SpringBoot project, and write an insert statement (the mapper will not be repeated here):

Below is the mapper interface and mapper.xml file ( sql statement in).

TestMapper.java

@Repository
public interface TestMapper.java {
    void testMassiveInsert(List<HashMap> list);
}

TestMapper.xml

    <insert id="testMassiveInsert">
        insert into massive_data_insert_test
            (value1,value2)
        values
        <foreach collection="list" item="item" separator=",">
             (#{item.value1},#{item.value2})
        </foreach>
    </insert>

Test statement:

@Service
@Slf4j
public class TestService {
        // 批量新增的最大数量
    private static final int maxInsertItemNumPerTime = 500;

    private TestMapper mapper;

    @Autowired
    public TestService(TestMapper mapper) {
        this.mapper = mapper;
    }

    public Result testMassiveInsert() {
        long startTime = System.currentTimeMillis();    //获取开始时间
        List<HashMap> list = new LinkedList<>();
          // 组装数据 获得一个长度为 500 * 80 = 40000的链表
        for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
            HashMap map = new HashMap();
            map.put("value1", "value1" + i);
            map.put("value2", "value2" + i);
            list.add(map);
        }
        // 直接批量插入
        try {
            mapper.testMassiveInsert(list);
        } catch (RuntimeException e) {
            log.info("直接批量插入" + list.size() + "失败", e.getMessage());
            throw new RuntimeException(e);
        }
        long endTime = System.currentTimeMillis();    //获取结束时间
        return Result.ok().message("程序运行时间：" + (endTime - startTime) + "ms");
    }
}

When doing the direct bulk insert above:

Directly reported I/O error , why is this?

As mentioned in the previous article, the the dynamic SQL statement 161d92b98afe67 is to splicing the content of the incoming parameters into the SQL of the inserted statement, so the reason for this error is that there are too many content to be spliced, resulting in the SQL statement being too long and thus Caused a I/O . Therefore, when the amount of data is too large, the splicing will be too long and the program will report an error.

And SQL is not only related to the amount of data, but also related to the number of insert parameters of the insert statement. In fact, SQL and the product of the two change linearly in a positive correlation, so when there are too many insertion parameters, it is necessary to control the amount of data to be inserted in batches.

So how to solve it? The simplest is to insert in batches, which is a bit like for loop mentioned at the beginning of the article, but this time the insertion in the for loop is batch insertion.

Since it is not a one-time insertion, it is necessary to add a transaction package to ensure that no matter which insertion error occurs, it can be rolled back.

    @Transactional
    public Result testMassiveInsert() {
        long startTime = System.currentTimeMillis();    //获取开始时间
        List<HashMap> list = new LinkedList<>();
          // 组装数据 获得一个长度为 500 * 80 = 40000的链表
        for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
            HashMap map = new HashMap();
            map.put("value1", "value1" + i);
            map.put("value2", "value2" + i);
            list.add(map);
        }
        // 分批次的批量插入
        try {
            if (list.size() > maxInsertItemNumPerTime) {
                List<List<HashMap>> all = new ArrayList<>();
                int i = 0;
                while (i < list.size()) {
                    List subList = list.subList(i, i + maxInsertItemNumPerTime);
                    i = i + maxInsertItemNumPerTime;
                    all.add(subList);
                }
                all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
            }
        } catch (RuntimeException e) {
            log.info("分批次批量插入" + list.size() + "失败", e.getMessage());
            throw new RuntimeException(e);
        }
        long endTime = System.currentTimeMillis();    //获取结束时间
        return Result.ok().message("程序运行时间：" + (endTime - startTime) + "ms");
    }

We control the length of each batch of data inserted by setting a maxInsertItemNumPerTime

`3. Simple test`

Below is my simple test (that is, the total number of inserts is 4w, but maxInsertItemNumPerTime is set different, the time-consuming calculation of the comparison program runs). However, this test does not perform many tests to average and may also have jitter in the network, so it can only be regarded as a simple test.

2000

1000

500

250

In the end I chose 500 as the size of each batch when inserting in batches. As we said above, even if there is connection multiplexing provided by the database connection pool, if there is too much interaction with the database, it will still cause performance degradation, so the maxInsertItemNumPerTime here is not as small as possible.

Meanwhile, with maxInsertItemNumPerTime increases, each for cycle SQL pretreatment process ( SQL splicing) Processed becomes large, and this is not a large linear but often presents exponential increases (I checked some data to confirm my guess), otherwise it would not be much larger than 500 when it was 2000.

In actual business, you also need a simple test to select a more appropriate value, which is better than no test.

`Fourth, do some expansion`

In fact, Mybatis of official document is to provide another way to support bulk inserts.

However, since the company's projects all use the method of scanning Mapper to operate the database, and this kind of big data insertion scene is indeed relatively rare, the following method provided by Mybatis is not specially introduced.

A multiple row insert is a single insert statement that inserts multiple rows into a table. This can be a convenient way to insert a few rows into a table, but it has some limitations:

Since it is a single SQL statement, you could generate quite a lot of prepared statement parameters. For example, suppose you wanted to insert 1000 records into a table, and each record had 5 fields. With a multiple row insert you would generate a SQL statement with 5000 parameters. There are limits to the number of parameters allowed in a JDBC prepared statement - and this kind of insert could easily exceed those limits. If you want to insert many records, you should probably use a JDBC batch insert instead (see below)
The performance of a giant insert statement may be less than you expect. If you have many records to insert, it will almost always be more efficient to use a JDBC batch insert (see below). With a batch insert, the JDBC driver can do some optimization that is not possible with a single large statement
Retrieving generated values with multiple row inserts can be a challenge. MyBatis currently has some limitations related to retrieving generated keys in multiple row inserts that require special considerations (see below)

Nevertheless, there are use cases for a multiple row insert - especially when you just want to insert a few records in a table and don’t need to retrieve generated keys. In those situations, a multiple row insert will be an easy solution.

translate:

    try (SqlSession session = sqlSessionFactory.openSession()) {
        GeneratedAlwaysAnnotatedMapper mapper = session.getMapper(GeneratedAlwaysAnnotatedMapper.class);
        List<GeneratedAlwaysRecord> records = getRecordsToInsert(); // not shown
            
        MultiRowInsertStatementProvider<GeneratedAlwaysRecord> multiRowInsert = insertMultiple(records)
                .into(generatedAlways)
                .map(id).toProperty("id")
                .map(firstName).toProperty("firstName")
                .map(lastName).toProperty("lastName")
                .build()
                .render(RenderingStrategies.MYBATIS3);
            
        int rows = mapper.insertMultiple(multiRowInsert);
    }

Above is the sample code from the documentation.

`5. Further optimize the code`

Since what I have to do in the actual development of the company is to import data into multiple tables, if I follow the above batch writing method, then each of my inserts must have this logic:

        // 分批次的批量插入
        try {
            if (list.size() > maxInsertItemNumPerTime) {
                List<List<HashMap>> all = new ArrayList<>();
                int i = 0;
                while (i < list.size()) {
                    List subList = list.subList(i, i + maxInsertItemNumPerTime);
                    i = i + maxInsertItemNumPerTime;
                    all.add(subList);
                }
                all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
            }
        } catch (RuntimeException e) {
            log.info("分批次批量插入" + list.size() + "失败", e.getMessage());
            throw new RuntimeException(e);
        }

Obviously this is a bad smell of repetitive code, and it looks bad. So next, make a simple encapsulation, encapsulate this code into a method, and whoever inserts it will call it.

First, what are the parameters that this code needs to pass in?

maxInsertItemNumPerTime doesn't need to be passed in because obviously this is a constant
list needs to be passed in, and the types are not all HashMap , but different tables correspond to different entity classes. The generic T .
mapper in testMassiveInsert(HashMap map) method, it is clear that the insertion of different tables mapper certainly not the same method, so this also needs to pass, passed as a parameter to a method, then you need to use the Lambda expressions and functional programming. If you know the functional interface, you will naturally think that such a function with only input and no output should Consumer (on the contrary, it corresponds to Supplier , and when it has input and output, it is Function ).

So the final abstracted code should look like this:

public <T> void batchSplitInsert(List<T> list, Consumer insertFunc) {
    List<List<T>> all = new ArrayList<>();
    if (list.size() > maxInsertItemNumPerTime) {
        int i = 0;
        while (i < list.size()) {
                if (i + maxInsertItemNumPerTime > list.size()){
                    subList = list.subList(i, list.size());
                }else {
                    subList = list.subList(i, i + maxInsertItemNumPerTime);
                }
                i = i + maxInsertItemNumPerTime;
                all.add(subList);
        }
        all.parallelStream().forEach(insertFunc);
    } else {
        insertFunc.accept(list);
    }
}

In this way, when I do inserts into different tables:

// 待插入数据链表
List<TableDTO> needToInsert = ……；
// 进行新增
Consumer<List<TableDTO>> consumer = o -> mapper.insertTable(o);
batchSplitInsert(needToInsert, consumer);

Now the whole world has become elegant! If you are new to the job or just learning Java and you still don’t know functional programming, you might as well start with a simple Lambda statement. This is really a super easy-to-use feature Java8

So far, this article is over. If you have any questions, please comment and leave a message, and we will communicate with each other.

If this article is helpful to you, please give it a thumbs up, thank you! If there is a big guy who has a better way to implement it, please be sure to let me know!

Practice of batch inserting large amounts of data using Mybatis

I. Introduction

`2. An error is reported when the amount of data inserted in batches reaches tens of thousands`

`3. Simple test`

`Fourth, do some expansion`

`5. Further optimize the code`

BeerBear啤酒熊

`引用和评论`

Java12的新特性

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性