1

foreword

A few days ago, due to business needs, the sub-project needed to connect the user data of the brother department, because the brother department does not provide an incremental user data interface, and can only synchronize the full amount of user data from the brother department each time. The full amount of user data is about tens of thousands. Because it is full data, we need to compare the data here ( Note: the user username is unique). If the synchronized data is not available on our side, we must do the insert operation. If we already have it, we must do it. update operation. This article will talk about how to compare when the amount of data is relatively large

comparison logic

Because the user username is unique, we can use the user username for comparison and matching

Comparison realization

1. Option 1: Two-layer nested loop comparison

That is: cyclically compare the full data of the interface with the full data of our database

Example

 @Override
    public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        for (MockUser mockUser : mockUsers) {
            for (User user : users) {
                if(mockUser.getUsername().equals(user.getUsername())){
                    int id = user.getId();
                    BeanUtils.copyProperties(mockUser,user);
                    user.setId(id);
                    updateUsers.add(user);
                }else{
                    User newUser = new User();
                    BeanUtils.copyProperties(mockUser,newUser);
                    addUsers.add(newUser);
                }
            }
        }

    }

In this way, I pressed 300,000 pieces of data in the test environment, and after waiting for about 20 minutes to compare the data, I directly OOM

2. Option 2: Use Bloom Filter

That is: before the comparison starts, first push the data on our side into the Bloom filter, and then use the Bloom filter to determine the interface data

Example

 @Override
    public void compareAndSave(List<User> users,List<MockUser> mockUsers){
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        BloomFilter<String> bloomFilter = getUserNameBloomFilter(users);
        for (MockUser mockUser : mockUsers) {
            boolean isExist = bloomFilter.mightContain(mockUser.getUsername());
            //更新
            if(isExist){
               User user = originUserMap.get(mockUser.getUsername());
               int id = user.getId();
               BeanUtils.copyProperties(mockUser,user);
               user.setId(id);
               updateUsers.add(user);
            }else{
                User user = new User();
                BeanUtils.copyProperties(mockUser,user);
                addUsers.add(user);
            }
        }

    }

Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 1 second

3. Option 3: Use list + map comparison

That is: before the comparison starts, first store our data in the map, the key of the map is username, and the value is user data, and then traverse the interface data to compare

Example

 @Override
    public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
        Map<String,User> originUserMap = getOriginUserMap(users);
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        for (MockUser mockUser : mockUsers) {
             if(originUserMap.containsKey(mockUser.getUsername())){
                 User user = originUserMap.get(mockUser.getUsername());
                 int id = user.getId();
                 BeanUtils.copyProperties(mockUser,user);
                 user.setId(id);
                 updateUsers.add(user);
             }else{
                 User user = new User();
                 BeanUtils.copyProperties(mockUser,user);
                 addUsers.add(user);
             }
        }
    }

Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 350 milliseconds

Summarize

Of these three schemes, the two-layer cycle efficiency is the lowest, and there is a risk of OOM as the amount of data increases. With Bloom filter, there is a risk of misjudgment. In order to reduce the risk of misjudgment, only the misjudgment rate can be reduced, which can be specified by parameters, but this also increases the judgment time. Using map can be said to be the most efficient, and his essence is to reduce the time complexity from O(n2) to O(n). However, this solution may not be the optimal solution. After discussing it with a friend, he said that what two-way pointer can be used, because I have not studied the algorithm in depth, so this article will not demonstrate it.

demo link

https://github.com/lyb-geek/springboot-learning/tree/master/springboot-comparedata


linyb极客之路
344 声望193 粉丝