foreword
A few days ago, due to business needs, the sub-project needed to connect the user data of the brother department, because the brother department does not provide an incremental user data interface, and can only synchronize the full amount of user data from the brother department each time. The full amount of user data is about tens of thousands. Because it is full data, we need to compare the data here ( Note: the user username is unique). If the synchronized data is not available on our side, we must do the insert operation. If we already have it, we must do it. update operation. This article will talk about how to compare when the amount of data is relatively large
comparison logic
Because the user username is unique, we can use the user username for comparison and matching
Comparison realization
1. Option 1: Two-layer nested loop comparison
That is: cyclically compare the full data of the interface with the full data of our database
Example
@Override
public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
List<User> addUsers = new ArrayList<>();
List<User> updateUsers = new ArrayList<>();
for (MockUser mockUser : mockUsers) {
for (User user : users) {
if(mockUser.getUsername().equals(user.getUsername())){
int id = user.getId();
BeanUtils.copyProperties(mockUser,user);
user.setId(id);
updateUsers.add(user);
}else{
User newUser = new User();
BeanUtils.copyProperties(mockUser,newUser);
addUsers.add(newUser);
}
}
}
}
In this way, I pressed 300,000 pieces of data in the test environment, and after waiting for about 20 minutes to compare the data, I directly OOM
2. Option 2: Use Bloom Filter
That is: before the comparison starts, first push the data on our side into the Bloom filter, and then use the Bloom filter to determine the interface data
Example
@Override
public void compareAndSave(List<User> users,List<MockUser> mockUsers){
List<User> addUsers = new ArrayList<>();
List<User> updateUsers = new ArrayList<>();
BloomFilter<String> bloomFilter = getUserNameBloomFilter(users);
for (MockUser mockUser : mockUsers) {
boolean isExist = bloomFilter.mightContain(mockUser.getUsername());
//更新
if(isExist){
User user = originUserMap.get(mockUser.getUsername());
int id = user.getId();
BeanUtils.copyProperties(mockUser,user);
user.setId(id);
updateUsers.add(user);
}else{
User user = new User();
BeanUtils.copyProperties(mockUser,user);
addUsers.add(user);
}
}
}
Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 1 second
3. Option 3: Use list + map comparison
That is: before the comparison starts, first store our data in the map, the key of the map is username, and the value is user data, and then traverse the interface data to compare
Example
@Override
public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
Map<String,User> originUserMap = getOriginUserMap(users);
List<User> addUsers = new ArrayList<>();
List<User> updateUsers = new ArrayList<>();
for (MockUser mockUser : mockUsers) {
if(originUserMap.containsKey(mockUser.getUsername())){
User user = originUserMap.get(mockUser.getUsername());
int id = user.getId();
BeanUtils.copyProperties(mockUser,user);
user.setId(id);
updateUsers.add(user);
}else{
User user = new User();
BeanUtils.copyProperties(mockUser,user);
addUsers.add(user);
}
}
}
Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 350 milliseconds
Summarize
Of these three schemes, the two-layer cycle efficiency is the lowest, and there is a risk of OOM as the amount of data increases. With Bloom filter, there is a risk of misjudgment. In order to reduce the risk of misjudgment, only the misjudgment rate can be reduced, which can be specified by parameters, but this also increases the judgment time. Using map can be said to be the most efficient, and his essence is to reduce the time complexity from O(n2) to O(n). However, this solution may not be the optimal solution. After discussing it with a friend, he said that what two-way pointer can be used, because I have not studied the algorithm in depth, so this article will not demonstrate it.
demo link
https://github.com/lyb-geek/springboot-learning/tree/master/springboot-comparedata
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。