python 的in 和 find 执行效率问题

发布于
2016-08-10

更新于
2016-08-11

if info_list[key] in content:

if content.find(info_list[key]) != -1

这两个查找字符串的效率差异。

bb.txt 5G 的文件，前者跑了 8 小时，find 跑了 20 小时。

汗！！有没有更快捷的方法？

#coding:utf-8
import os
import sys

def getList(filename) :
    fp = open(filename, 'r')
    info_list = {}
    for line in open(filename):
        line = fp.readline()
        tmp = line.strip('\n').split('\t')
        info_list[tmp[0]] = tmp[1]
    fp.close()
    return info_list


def matchname(info_list, input_file, output_file) : 
    fp = open(input_file, 'r')
    fw = open(output_file, 'a')
    for line in open(input_file):
        line = fp.readline()
        tmp = line.strip('\n').split('\t')
        content = tmp[2]
        for key in info_list:
            if info_list[key] in content:
            #if content.find(info_list[key]) != -1 :
                result = tmp[0] + '\t' + tmp[1] + '\t' +  info_list[key] + '\t'  + key + '\n'
                fw.write(result)
            else :
                continue
    fw.close()
    fp.close()

if __name__ == '__main__':
    # date=sys.argv[1]
    info_filename = 'aa.txt'
    content_filename = 'bb.txt'
    result_filename = 'final_output2.txt'
    info_list = getList(info_filename)
    matchname(info_list, content_filename, result_filename)
    print('done...')

a@bj-m-20a:~/study/python_learn$ cat aa.txt
aaa    赵六
bbb    赵四
ccc    李丽
ddd    吴小龙

a@bj-m-20a:~/study/python_learn$ cat bb.txt
001    0001    李丽你好啊李丽是个大美女
002    0002    赵四家的后厨赵六你是我的爱
003    0003    吴小龙上次数学考的不及格

a@bj-m-20a:~/study/python_learn$ cat final_output2.txt
001    0001    李丽    ccc
001    0001    李丽    ccc
002    0002    赵六    aaa
002    0002    赵四    bbb
003    0003    吴小龙    ddd
001    0001    李丽    ccc
002    0002    赵六    aaa
002    0002    赵四    bbb
003    0003    吴小龙    ddd

python 性能

阅读 12.3k

3 个回答

dokelung

✓ 已被采纳

改了一下你的代碼, 這樣應該比較簡潔:

(根據 @依云的建議又改了一下)

import os
import sys

def getinfo(filename) :
    info = {}
    with open(filename, 'r') as f:
        for line in f:
            ID, name = line.strip().split()
            info[ID] = name
    return info


def matchname(info, input_file, output_file) : 
    with open(input_file, 'r') as reader, open(output_file, 'w') as writer:
        for line in reader:
            n1, n2, content = line.strip().split()
            for ID, name in info.items():
                if name in content:
                    print(n1, n2, name, ID, sep='\t', file=writer)


if __name__ == '__main__':
    info_filename = 'aa.txt'
    content_filename = 'bb.txt'
    result_filename = 'final_output2.txt'
    info = getinfo(info_filename)
    matchname(info, content_filename, result_filename)
    print('done')

(稍後回來補說明...)

我回答過的問題: Python-QA

依云

25k72862

发布于
2016-08-11

in 当然比 find 快，因为前者比后者少了次属性查找、函数调用，多了次比较操作：

>>> def t():
...   return "abctestdef".find("testx")
... 
>>> import dis
>>> dis.dis(t)
  2           0 LOAD_CONST               1 ('abctestdef')
              3 LOAD_ATTR                0 (find)
              6 LOAD_CONST               2 ('testx')
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 RETURN_VALUE
>>> def t():
...   return "test" in "abctestdef"
... 
>>> dis.dis(t)
  2           0 LOAD_CONST               1 ('test')
              3 LOAD_CONST               2 ('abctestdef')
              6 COMPARE_OP               6 (in)
              9 RETURN_VALUE

想要更快，可以考虑使用 Rust :-)

另外你的代码写得不太好。文件操作建议使用 with 而不是手动关闭。