I. Introduction
In recent days, when developing the company's business, I encountered a scenario where a large amount of data needs to be inserted into multiple tables in the database, so I have this article:
Considerations when using Mybatis
insert data in batches, and a simple encapsulation of batched data insertion methods using functional programming.
For most of the Java
back-end developers CURD
development work, it is inevitable to use the Mybatis
tool to operate the database.
In SpringBoot
project, it is introduced Mybatis
after use mapper
be implemented CRUD way of injection.
For example, if you want to add a new piece of data, you can write this mapper.xml
<insert id="testInsert">
insert into massive_data_insert_test
(value1,value2)
values
(#{value1},#{value2})
</insert>
Then call mapper.insertOneItem(insertItem);
on the service
layer.
If you want to add multiple pieces of data, if you are Java
, you may write this:
for(int i = 0; i < insertList.size(); i++){
mapper.insertOneItem(insertList.get(i));
}
It is simply placed in a loop, and the new database method in mapper
This method is simple and easy to understand, and there is no problem when the amount of data is relatively small.
But once there is too much data, there will be problems.
In fact, each mapper
method invocation is a connecting to the database, preprocessing (PreparedStatement), and executing (executing SQL).
Thus found, if placed for
cycle, then the process will be executed multiple times, and we know the number of connections to the database is limited (for connection establishment is very resource-intensive) If for
when too many cycles , not only the performance will drop, but also the database will be blocked or even the program will crash. Of course, we can create or configure database connection pools (such as HikariCP、Durid
etc.) to reuse connections, but this is still a waste of resources.
All in all, if there is a way to get what you need to do all at once, don't do it multiple times.
Most new statements in databases support inserting multiple pieces of data at one time.
insert into table
(value1, value2)
values ('v1','v2'),('v3','v4'),('v1','v5')
Mybatis
also gives a method for batch manipulation of data. Using <foreach>
tag in the dynamic SQL statement will help us splice out the SQL
statement similar to the above.
We write the following mapper.xml
<insert id="testMassiveInsert">
insert into massive_data_insert_test
(value1,value2)
values
<foreach collection="list" item="item" separator=",">
(#{item.value1},#{item.value2})
</foreach>
</insert>
In this way, we only need to call the method in mapper
for
loop code.
And actually execute SQL
once. The function of this dynamic SQL
statement is to splice the content of the incoming parameter into SQL
of the insert statement (preprocessing technology).
This approach is obviously better than the for
loop implementation.
2. An error is reported when the amount of data inserted in batches reaches tens of thousands
However, when we use the above-mentioned method of splicing SQL
for batch insertion, there will also be problems when the amount of data is too large!
We can first experiment to see what happens when a batch of 40,000 pieces of data is inserted.
First, create a new table as the target table for inserting data: massive_data_insert_test.
CREATE TABLE "supply"."massive_data_insert_test" (
"value1" varchar(255) COLLATE "pg_catalog"."default",
"value2" varchar(255) COLLATE "pg_catalog"."default"
);
Just connect to the database and create mapper
in a SpringBoot
project, and write an insert statement (the mapper
will not be repeated here):
Below is the mapper
interface and mapper.xml
file ( sql
statement in).
TestMapper.java
@Repository
public interface TestMapper.java {
void testMassiveInsert(List<HashMap> list);
}
TestMapper.xml
<insert id="testMassiveInsert">
insert into massive_data_insert_test
(value1,value2)
values
<foreach collection="list" item="item" separator=",">
(#{item.value1},#{item.value2})
</foreach>
</insert>
Test statement:
@Service
@Slf4j
public class TestService {
// 批量新增的最大数量
private static final int maxInsertItemNumPerTime = 500;
private TestMapper mapper;
@Autowired
public TestService(TestMapper mapper) {
this.mapper = mapper;
}
public Result testMassiveInsert() {
long startTime = System.currentTimeMillis(); //获取开始时间
List<HashMap> list = new LinkedList<>();
// 组装数据 获得一个长度为 500 * 80 = 40000的链表
for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
HashMap map = new HashMap();
map.put("value1", "value1" + i);
map.put("value2", "value2" + i);
list.add(map);
}
// 直接批量插入
try {
mapper.testMassiveInsert(list);
} catch (RuntimeException e) {
log.info("直接批量插入" + list.size() + "失败", e.getMessage());
throw new RuntimeException(e);
}
long endTime = System.currentTimeMillis(); //获取结束时间
return Result.ok().message("程序运行时间:" + (endTime - startTime) + "ms");
}
}
When doing the direct bulk insert above:
Directly reported I/O error
, why is this?
As mentioned in the previous article, the the dynamic SQL statement 161d92b98afe67 is to splicing the content of the incoming parameters into the SQL of the inserted statement, so the reason for this error is that there are too many content to be spliced, resulting in the
SQL
statement being too long and thus Caused a I/O
. Therefore, when the amount of data is too large, the splicing will be too long and the program will report an error.
And SQL
is not only related to the amount of data, but also related to the number of insert parameters of the insert statement. In fact, SQL
and the product of the two change linearly in a positive correlation, so when there are too many insertion parameters, it is necessary to control the amount of data to be inserted in batches.
So how to solve it? The simplest is to insert in batches, which is a bit like for
loop mentioned at the beginning of the article, but this time the insertion in the for loop is batch insertion.
Since it is not a one-time insertion, it is necessary to add a transaction package to ensure that no matter which insertion error occurs, it can be rolled back.
@Transactional
public Result testMassiveInsert() {
long startTime = System.currentTimeMillis(); //获取开始时间
List<HashMap> list = new LinkedList<>();
// 组装数据 获得一个长度为 500 * 80 = 40000的链表
for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
HashMap map = new HashMap();
map.put("value1", "value1" + i);
map.put("value2", "value2" + i);
list.add(map);
}
// 分批次的批量插入
try {
if (list.size() > maxInsertItemNumPerTime) {
List<List<HashMap>> all = new ArrayList<>();
int i = 0;
while (i < list.size()) {
List subList = list.subList(i, i + maxInsertItemNumPerTime);
i = i + maxInsertItemNumPerTime;
all.add(subList);
}
all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
}
} catch (RuntimeException e) {
log.info("分批次批量插入" + list.size() + "失败", e.getMessage());
throw new RuntimeException(e);
}
long endTime = System.currentTimeMillis(); //获取结束时间
return Result.ok().message("程序运行时间:" + (endTime - startTime) + "ms");
}
We control the length of each batch of data inserted by setting a maxInsertItemNumPerTime
3. Simple test
Below is my simple test (that is, the total number of inserts is 4w, but maxInsertItemNumPerTime
is set different, the time-consuming calculation of the comparison program runs). However, this test does not perform many tests to average and may also have jitter in the network, so it can only be regarded as a simple test.
- 2000
- 1000
- 500
- 250
In the end I chose 500 as the size of each batch when inserting in batches. As we said above, even if there is connection multiplexing provided by the database connection pool, if there is too much interaction with the database, it will still cause performance degradation, so the maxInsertItemNumPerTime
here is not as small as possible.
Meanwhile, with maxInsertItemNumPerTime
increases, each for
cycle SQL
pretreatment process ( SQL
splicing) Processed becomes large, and this is not a large linear but often presents exponential increases (I checked some data to confirm my guess), otherwise it would not be much larger than 500 when it was 2000.
In actual business, you also need a simple test to select a more appropriate value, which is better than no test.
Fourth, do some expansion
In fact, Mybatis of official document is to provide another way to support bulk inserts.
However, since the company's projects all use the method of scanning Mapper to operate the database, and this kind of big data insertion scene is indeed relatively rare, the following method provided by Mybatis is not specially introduced.
A multiple row insert is a single insert statement that inserts multiple rows into a table. This can be a convenient way to insert a few rows into a table, but it has some limitations:
- Since it is a single SQL statement, you could generate quite a lot of prepared statement parameters. For example, suppose you wanted to insert 1000 records into a table, and each record had 5 fields. With a multiple row insert you would generate a SQL statement with 5000 parameters. There are limits to the number of parameters allowed in a JDBC prepared statement - and this kind of insert could easily exceed those limits. If you want to insert many records, you should probably use a JDBC batch insert instead (see below)
- The performance of a giant insert statement may be less than you expect. If you have many records to insert, it will almost always be more efficient to use a JDBC batch insert (see below). With a batch insert, the JDBC driver can do some optimization that is not possible with a single large statement
- Retrieving generated values with multiple row inserts can be a challenge. MyBatis currently has some limitations related to retrieving generated keys in multiple row inserts that require special considerations (see below)
Nevertheless, there are use cases for a multiple row insert - especially when you just want to insert a few records in a table and don’t need to retrieve generated keys. In those situations, a multiple row insert will be an easy solution.
translate:
try (SqlSession session = sqlSessionFactory.openSession()) {
GeneratedAlwaysAnnotatedMapper mapper = session.getMapper(GeneratedAlwaysAnnotatedMapper.class);
List<GeneratedAlwaysRecord> records = getRecordsToInsert(); // not shown
MultiRowInsertStatementProvider<GeneratedAlwaysRecord> multiRowInsert = insertMultiple(records)
.into(generatedAlways)
.map(id).toProperty("id")
.map(firstName).toProperty("firstName")
.map(lastName).toProperty("lastName")
.build()
.render(RenderingStrategies.MYBATIS3);
int rows = mapper.insertMultiple(multiRowInsert);
}
Above is the sample code from the documentation.
5. Further optimize the code
Since what I have to do in the actual development of the company is to import data into multiple tables, if I follow the above batch writing method, then each of my inserts must have this logic:
// 分批次的批量插入
try {
if (list.size() > maxInsertItemNumPerTime) {
List<List<HashMap>> all = new ArrayList<>();
int i = 0;
while (i < list.size()) {
List subList = list.subList(i, i + maxInsertItemNumPerTime);
i = i + maxInsertItemNumPerTime;
all.add(subList);
}
all.parallelStream().forEach(o -> mapper.testMassiveInsert(o));
}
} catch (RuntimeException e) {
log.info("分批次批量插入" + list.size() + "失败", e.getMessage());
throw new RuntimeException(e);
}
Obviously this is a bad smell of repetitive code, and it looks bad. So next, make a simple encapsulation, encapsulate this code into a method, and whoever inserts it will call it.
First, what are the parameters that this code needs to pass in?
maxInsertItemNumPerTime
doesn't need to be passed in because obviously this is a constantlist
needs to be passed in, and the types are not allHashMap
, but different tables correspond to different entity classes. The genericT
.mapper
intestMassiveInsert(HashMap map)
method, it is clear that the insertion of different tablesmapper
certainly not the same method, so this also needs to pass, passed as a parameter to a method, then you need to usethe Lambda expressions and functional programming. If you know the functional interface, you will naturally think that such a function with only input and no output should
Consumer
(on the contrary, it corresponds toSupplier
, and when it has input and output, it isFunction
).
So the final abstracted code should look like this:
public <T> void batchSplitInsert(List<T> list, Consumer insertFunc) {
List<List<T>> all = new ArrayList<>();
if (list.size() > maxInsertItemNumPerTime) {
int i = 0;
while (i < list.size()) {
if (i + maxInsertItemNumPerTime > list.size()){
subList = list.subList(i, list.size());
}else {
subList = list.subList(i, i + maxInsertItemNumPerTime);
}
i = i + maxInsertItemNumPerTime;
all.add(subList);
}
all.parallelStream().forEach(insertFunc);
} else {
insertFunc.accept(list);
}
}
In this way, when I do inserts into different tables:
// 待插入数据链表
List<TableDTO> needToInsert = ……;
// 进行新增
Consumer<List<TableDTO>> consumer = o -> mapper.insertTable(o);
batchSplitInsert(needToInsert, consumer);
Now the whole world has become elegant! If you are new to the job or just learning Java
and you still don’t know functional programming, you might as well start with a simple Lambda
statement. This is really a super easy-to-use feature Java8
So far, this article is over. If you have any questions, please comment and leave a message, and we will communicate with each other.
If this article is helpful to you, please give it a thumbs up, thank you! If there is a big guy who has a better way to implement it, please be sure to let me know!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。