python 优化文本处理效率以及优化内存占用?

有一个大文件(795G)共7列,如果第1,2,3,6,7列相同,则,第4,5列的值相加。
写了一个简单的版本,可以实现,但是服务器内存只有200G,读不进去?
我的程序代码如下:

#!/usr/bin/python
# -*- coding: utf-8 -*-
__author__ = ' author'
__author_email__ = 'xxx@icloud.com'


def add(line, anno):
    chr, position, strand, methy_read, all_read, methy_nt, nt = line.strip().split()
    key = (chr, position, strand, methy_nt, nt)
    if key in anno.keys():
        anno[key] = map(lambda x, y: x + y, anno[key], (int(methy_read), int(all_read)))
    if key not in anno.keys():
        anno[key] = (int(methy_read), int(all_read))
    return anno


with open('test.tab', 'r') as f:
    dict1 = {}
    for line in f:
        add(line, dict1)



    for key, value in dict1.items():
        key = list(key)
        value = list(value)
        print(*key, *value, sep='\t')

阅读 2.3k
2 个回答

for line in f 读文件的方式生成器惰性处理,正常情况不会用多大内存,如果不是(1, 2, 3, 6, 7)的组合变化太多,不大可能耗尽内存,当然也有可能是数据文件本身有问题,比方说换行符缺失。

可能代码还是还是有问题,个人觉得你写得不是很清晰,我改了下,供参考

from collections import defaultdict

def readTabFile(filename):
    anno = defaultdict(lambda: (0, 0))
    with open(filename, 'r') as f:
        for line in f:
            add(line)
    return anno

    def add(line):
        chr, position, strand, methy_read, all_read, methy_nt, nt = line.strip().split()
        k = (chr, position, strand, methy_nt, nt)
        v = (int(methy_read), int(all_read))
        nonlocal anno
        import operator
        anno[k] = tuple(map(operator.add, anno[k], v))


for k, v in readTabFile('test.tab').items():
    print(*k, *v, sep='\t')
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题