parallel并行命令原理问题

对于parallel这个工具的官网介绍中的一段话有点不理解:
For better parallelism GNU parallel can distribute the arguments between all the parallel jobs when end of file is met.

Below GNU parallel reads the last argument when generating the second job. When GNU parallel reads the last argument, it spreads all the arguments for the second job over 4 jobs instead, as 4 parallel jobs are requested.

The first job will be the same as the --xargs example above, but the second job will be split into 4 evenly sized jobs, resulting in a total of 5 jobs:

cat num30000 | parallel --jobs 4 -m echo | wc -l
Output (if you run this under Bash on GNU/Linux):

5
上面明明是分成4个job,为什么结果是5行?
其次是按照上面的说法是parallel会先读完文件然后将文件内容作为参数分配给各个job吗?要是文件很大读完文件再分配岂不是很费时间?譬如统计一个很大文件的行数的话,这样先读完文件再分配任务(仅仅是统计行数)并行运算,应该比直接wc -l花费时间更多吧?

更奇怪的是,在我的计算机上面运行结果是6行?

[10:01 sxuan@hulab ~]$ cat num30000 | parallel --jobs 4 -m echo | wc -l
6

谢谢!

阅读 4.4k
1 个回答

-m会把多行输入当作参数传给命令,而参数长度是有限的,所以会开多于4个进程进行处理。

> seq 1 30000 | parallel --jobs 4 -m echo | wc -l
5
> seq 1 100000 | parallel --jobs 4 -m echo | wc -l
8

可以用xargs --show-limits看到参数长度限制:

> xargs --show-limits
Your environment variables take up 892 bytes
POSIX upper limit on argument length (this system): 2094212
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2093320
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题