How to efficiently compare two sets of data when the amount of data is relatively large

foreword

A few days ago, due to business needs, the sub-project needed to connect the user data of the brother department, because the brother department does not provide an incremental user data interface, and can only synchronize the full amount of user data from the brother department each time. The full amount of user data is about tens of thousands. Because it is full data, we need to compare the data here ( Note: the user username is unique). If the synchronized data is not available on our side, we must do the insert operation. If we already have it, we must do it. update operation. This article will talk about how to compare when the amount of data is relatively large

comparison logic

Because the user username is unique, we can use the user username for comparison and matching

Comparison realization

1. Option 1: Two-layer nested loop comparison

That is: cyclically compare the full data of the interface with the full data of our database

Example

 @Override
    public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        for (MockUser mockUser : mockUsers) {
            for (User user : users) {
                if(mockUser.getUsername().equals(user.getUsername())){
                    int id = user.getId();
                    BeanUtils.copyProperties(mockUser,user);
                    user.setId(id);
                    updateUsers.add(user);
                }else{
                    User newUser = new User();
                    BeanUtils.copyProperties(mockUser,newUser);
                    addUsers.add(newUser);
                }
            }
        }

    }

In this way, I pressed 300,000 pieces of data in the test environment, and after waiting for about 20 minutes to compare the data, I directly OOM

2. Option 2: Use Bloom Filter

That is: before the comparison starts, first push the data on our side into the Bloom filter, and then use the Bloom filter to determine the interface data

Example

 @Override
    public void compareAndSave(List<User> users,List<MockUser> mockUsers){
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        BloomFilter<String> bloomFilter = getUserNameBloomFilter(users);
        for (MockUser mockUser : mockUsers) {
            boolean isExist = bloomFilter.mightContain(mockUser.getUsername());
            //更新
            if(isExist){
               User user = originUserMap.get(mockUser.getUsername());
               int id = user.getId();
               BeanUtils.copyProperties(mockUser,user);
               user.setId(id);
               updateUsers.add(user);
            }else{
                User user = new User();
                BeanUtils.copyProperties(mockUser,user);
                addUsers.add(user);
            }
        }

    }

Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 1 second

3. Option 3: Use list + map comparison

That is: before the comparison starts, first store our data in the map, the key of the map is username, and the value is user data, and then traverse the interface data to compare

Example

 @Override
    public void compareAndSave(List<User> users, List<MockUser> mockUsers) {
        Map<String,User> originUserMap = getOriginUserMap(users);
        List<User> addUsers = new ArrayList<>();
        List<User> updateUsers = new ArrayList<>();
        for (MockUser mockUser : mockUsers) {
             if(originUserMap.containsKey(mockUser.getUsername())){
                 User user = originUserMap.get(mockUser.getUsername());
                 int id = user.getId();
                 BeanUtils.copyProperties(mockUser,user);
                 user.setId(id);
                 updateUsers.add(user);
             }else{
                 User user = new User();
                 BeanUtils.copyProperties(mockUser,user);
                 addUsers.add(user);
             }
        }
    }

Using this method, I pressed 300,000 pieces of data in the test environment, and the comparison took about 350 milliseconds

Summarize

Of these three schemes, the two-layer cycle efficiency is the lowest, and there is a risk of OOM as the amount of data increases. With Bloom filter, there is a risk of misjudgment. In order to reduce the risk of misjudgment, only the misjudgment rate can be reduced, which can be specified by parameters, but this also increases the judgment time. Using map can be said to be the most efficient, and his essence is to reduce the time complexity from O(n2) to O(n). However, this solution may not be the optimal solution. After discussing it with a friend, he said that what two-way pointer can be used, because I have not studied the algorithm in depth, so this article will not demonstrate it.

demo link

https://github.com/lyb-geek/springboot-learning/tree/master/springboot-comparedata

How to efficiently compare two sets of data when the amount of data is relatively large

foreword

comparison logic

Comparison realization

Summarize

demo link

linyb极客之路

引用和评论

深度揭秘！Java Class 文件加密终极指南，有效保护你的核心代码

ClkLog埋点分析系统-环境部署配置指南

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

ClkLog埋点系统基于ClickHouse的百万日活测试报告

分析型数据库入门指南：如何选择适合你的实时分析工具？

ClkLog埋点用户分析系统支持手机端查询统计数据

How to efficiently compare two sets of data when the amount of data is relatively large

foreword

comparison logic

Comparison realization

Summarize

demo link

linyb极客之路

引用和评论

深度揭秘！Java Class 文件加密终极指南，有效保护你的核心代码

ClkLog埋点分析系统-环境部署配置指南

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

ClkLog埋点系统基于ClickHouse的百万日活测试报告

分析型数据库入门指南：如何选择适合你的实时分析工具？

ClkLog埋点用户分析系统支持手机端查询统计数据

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈