项目需要处理上游180g的gz文件,读取文件内容过滤去重后仍以gz格式传给下游,下面是几种解决思路.

1.直接使用linux less命令绕过压缩/解压缩,实测效率极差,放弃.

2.Java api GZIPOutputStream,待测试,估计表现不会太好

3.单线程bash gzip/gunzip
gunzip耗时16分钟
gzip耗时18分钟

2022-08-01 13:21  started
2022-08-01 13:55 finished

4.多线程 bash gzip/gunzip
io操作的瓶颈应该在磁盘,感觉多线程效率不会高,决定做个测试.(脱敏伪代码)

public class CopyFileMain {

    static Integer callShell(String command) {
        try {
            Process p = Runtime.getRuntime().exec(command);
            return p.waitFor();
        } catch (Exception e) {
            //
        }
        return -1;
    }

    public static void main(String[] args) {
        ExecutorService executorService = Executors.newFixedThreadPool(3);
        int fileSize = 3;
        String command = "bash gunzip xx.gz";

        List<Future<?>> list = new ArrayList<>();

        for (int i = 0; i < fileSize; i++) {
            Future<?> result = executorService.submit(() -> callShell(command));
            list.add(result);
        }

        while (true) {
            boolean finished = false;
            for (Future<?> future : list) {
                if(!future.isDone()){
                    finished=false;
                    break;
                }
                finished=true;
            }

            if(finished){
                System.out.println("finished");
                break;
            }
        }
    }
        shutdown();
}

gunzip耗时:13分钟
暂时得出结果: 瓶颈在磁盘io


chen
1 声望0 粉丝

野生java全栈开发