格式转换代码见下边,就是代码运行起来很慢,想看看大家是否有优化方案
Number of segment pairs = 182; number of pairwise comparisons = 40
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* SCL279Contig1 ********************
whvs7e09.R+
24990481+
******************* SCL279Contig2 ********************
et|RFL_Contig5917+
24993123+ is in et|RFL_Contig5917+
whsctal27f01.R+ is in et|RFL_Contig5917+
whxn27054l15.R+ is in et|RFL_Contig5917+
whsctal3n06.F- is in et|RFL_Contig5917+
whthkles18l09.R+ is in et|RFL_Contig5917+
whsctal3n06.R+ is in et|RFL_Contig5917+
32771503+ is in whsctal3n06.R+
32678311+ is in et|RFL_Contig5917+
whxn27054l15.F-
DETAILED DISPLAY OF CONTIGS
******************* SCL279Contig1 ********************
. : . : . : . : . : . :
whvs7e09.R+ CCGCAGGGGTCGCGGCCGAGCAGGGCGGCGCCGGCGACGAGCTCTCCGCTCTGTTCAAGG
____________________________________________________________
consensus CCGCAGGGGTCGCGGCCGAGCAGGGCGGCGCCGGCGACGAGCTCTCCGCTCTGTTCAAGG
. : . : . : . : . : . :
whvs7e09.R+ AGTTCTCGCTAGACAGCAGCAGCACCTTCGCCGAGGCGCGGATCCGGGCCACCTTCTACC
____________________________________________________________
consensus AGTTCTCGCTAGACAGCAGCAGCACCTTCGCCGAGGCGCGGATCCGGGCCACCTTCTACC
. : . : . : . : . : . :
whvs7e09.R+ CCAAGTTCGAGAACGAGGAATCCGACCAGGAGTCAAGAACCCGGATGATTGAGATGGTGT
____________________________________________________________
consensus CCAAGTTCGAGAACGAGGAATCCGACCAGGAGTCAAGAACCCGGATGATTGAGATGGTGT
. : . : . : . : . : . :
whvs7e09.R+ CACAAGGATTAGCTACCATGGAGGTTACGCTCAAGCATTCAGGGTCTTTGTTCATGTATG
24990481+ TTCATGTATG
____________________________________________________________
consensus CACAAGGATTAGCTACCATGGAGGTTACGCTCAAGCATTCAGGGTCTTTGTTCATGTATG
. : . : . : . : . : . :
whvs7e09.R+ CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG
24990481+ CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG
____________________________________________________________
consensus CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG
. : . : . : . : . : . :
whvs7e09.R+ GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC
24990481+ GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC
____________________________________________________________
consensus GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC
. : . : . : . : . : . :
whvs7e09.R+ AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA
24990481+ AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA
____________________________________________________________
consensus AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA
. : . : . : . : . : . :
whvs7e09.R+ CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG
24990481+ CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG
____________________________________________________________
consensus CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG
Number of segment pairs = 600; number of pairwise comparisons = 172
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* SCL292Contig1 ********************
whsctal16b21.F+
9428019- is in whsctal16b21.F+
whxq28060f16.F+ is in whsctal16b21.F+
whxq29060f16.F+ is in whxq28060f16.F+
whxn14056i15.F+
whxq28060f16.R-
9362629- is in whxq28060f16.R-
whxq29060f16.R-
******************* SCL292Contig2 ********************
et|tplb0013k02+
whthls7l03.R+ is in et|tplb0013k02+
9561217+ is in et|tplb0013k02+
9561363+ is in et|tplb0013k02+
93033944+ is in 9561363+
14317289+ is in et|tplb0013k02+
32659187+ is in et|tplb0013k02+
32663705+ is in et|tplb0013k02+
93191970+ is in et|tplb0013k02+
32786704+ is in et|tplb0013k02+
whsctal16b21.R+ is in 32786704+
33217630+ is in et|tplb0013k02+
whxn14056i15.R+ is in 33217630+
55676375+ is in et|tplb0013k02+
93032669- is in et|tplb0013k02+
whv16n6d15.F- is in 93032669-
whv16n6d15.R+ is in 93032669-
DETAILED DISPLAY OF CONTIGS
******************* SCL292Contig1 ********************
. : . : . : . : . : . :
whsctal16b21.F+ GGCATACTATAGCATCATTGTGGTCTGGAAACATTGGAGGGCTATAATGAAAAAAAATAC
____________________________________________________________
consensus GGCATACTATAGCATCATTGTGGTCTGGAAACATTGGAGGGCTATAATGAAAAAAAATAC
. : . : . : . : . : . :
whsctal16b21.F+ TAAATTGAGTTGAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
9428019- AGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxq28060f16.F+ TCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxq29060f16.F+ TCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxn14056i15.F+ CCATACAACAACTGAAACTTTCTGGTGCTA
____________________________________________________________
consensus TAAATTGAGTTCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
. : . : . : . : . : . :
whsctal16b21.F+ CTTACACCTGGGTCAGGCTCTTGCAGAGCTGGAGCAAATTTGTAGCTCAGCGTTGCAATG
9428019- CTTACACCTGGGTCAGGCTCTTGCAGAGCTGGAGCAAATTTGTAGCTCAGCGTTGCAATG
whxq28060f16.F+ CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq29060f16.F+ CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxn14056i15.F+ CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq28060f16.R- ATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq29060f16.R- ATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
____________________________________________________________
consensus CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
需要的结果是:
SCL279Contig1 whvs7e09.R 24990481
SCL279Contig2 et|RFL_Contig5917 24993123 whsctal27f01.R whxn27054l15.R whsctal3n06.F whthkles18l09.R whsctal3n06.R 32771503 32678311 whxn27054l15.F-
SCL292Contig1 whsctal16b21.F 9428019 whxq28060f16.F whxq29060f16.F whxn14056i15.F whxq28060f16.R 9362629 whxq29060f16.R
SCL292Contig2 et|tplb0013k02 whthls7l03.R 9561217 9561363 93033944 14317289 32659187 32663705 93191970 32786704 whsctal16b21.R 33217630 whxn14056i15.R 55676375 93032669 whv16n6d15.F whv16n6d15.R
我写的代码
# encoding: utf-8
with open('1.txt', 'r') as f:
a = []
b = []
for num, line in enumerate(f):
if 'Overlaps' in line:
a.append(num)
if 'DETAILED' in line:
b.append(num)
f.seek(0, 0)
for i, j in zip(a, b):
lines = f.readlines()[i:j+1] #读取指定的行数
if '**' in ''.join(lines):
for i in range(len(lines)): #下边就是对获取的行数进行格式转换
new = lines[i].strip().split()
if '*******************' in new:
print '\n' + new[1],
elif len(new) == 1:
print new[0][:-1],
elif 'in' in new:
print new[0][:-1],
f.seek(0, 0)
结果大概有10万条,上述代码运行了6个小时还没运行完,哪位大侠有优化的好点子吗?谢谢指点
對於你的轉換細節我不了解,所以這邊我不多做考慮,純粹就你的代碼上的問題來提出一些可能增進效能的建議。
首先看的出來你採取的是二段式的處理手法。
先 parse 過整個檔案,找出包含關鍵字
'Overlaps'
和'DETAILED'
的行數,並將行序記錄下來。對於每一個由
'Overlaps'
和'DETAILED'
劃分出來的區間,進行文本的轉換處理。這種兩段式的處理手法本來就會造成效率的低落,因為大部分的轉換工作都可以一段式的完成,我們應該盡可能地邊讀邊處理。
不過兩段式的處理還不是造成跑不完的關鍵主因,主要還是你每處理一個 區間 就必須要重讀整個文件!
你會發現從頭到尾你讀了這個文件
len(a)+1
次。對於十萬行的文件來說,這實在是太傷了。那應該怎麼辦呢? 下面也許是一個可行的辦法:
這個代碼從頭到尾只讀了文件一次,對於區區十萬行的讀檔跟文件轉換,Python 絕對沒什麼問題,基本上也不需要任何平行的處理。
另外我在這篇中的回答可能對你也有幫助,你可以參考看看,加油。