python 文本格式转换代码优化

格式转换代码见下边,就是代码运行起来很慢,想看看大家是否有优化方案


Number of segment pairs = 182; number of pairwise comparisons = 40
'+' means given segment; '-' means reverse complement

Overlaps            Containments  No. of Constraints Supporting Overlap

******************* SCL279Contig1 ********************
whvs7e09.R+
24990481+
******************* SCL279Contig2 ********************
et|RFL_Contig5917+
                    24993123+ is in et|RFL_Contig5917+
                    whsctal27f01.R+ is in et|RFL_Contig5917+
                    whxn27054l15.R+ is in et|RFL_Contig5917+
                    whsctal3n06.F- is in et|RFL_Contig5917+
                    whthkles18l09.R+ is in et|RFL_Contig5917+
                    whsctal3n06.R+ is in et|RFL_Contig5917+
                    32771503+ is in whsctal3n06.R+
                    32678311+ is in et|RFL_Contig5917+
whxn27054l15.F-

DETAILED DISPLAY OF CONTIGS
******************* SCL279Contig1 ********************
                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           CCGCAGGGGTCGCGGCCGAGCAGGGCGGCGCCGGCGACGAGCTCTCCGCTCTGTTCAAGG
                      ____________________________________________________________
consensus             CCGCAGGGGTCGCGGCCGAGCAGGGCGGCGCCGGCGACGAGCTCTCCGCTCTGTTCAAGG

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           AGTTCTCGCTAGACAGCAGCAGCACCTTCGCCGAGGCGCGGATCCGGGCCACCTTCTACC
                      ____________________________________________________________
consensus             AGTTCTCGCTAGACAGCAGCAGCACCTTCGCCGAGGCGCGGATCCGGGCCACCTTCTACC

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           CCAAGTTCGAGAACGAGGAATCCGACCAGGAGTCAAGAACCCGGATGATTGAGATGGTGT
                      ____________________________________________________________
consensus             CCAAGTTCGAGAACGAGGAATCCGACCAGGAGTCAAGAACCCGGATGATTGAGATGGTGT

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           CACAAGGATTAGCTACCATGGAGGTTACGCTCAAGCATTCAGGGTCTTTGTTCATGTATG
24990481+                                                               TTCATGTATG
                      ____________________________________________________________
consensus             CACAAGGATTAGCTACCATGGAGGTTACGCTCAAGCATTCAGGGTCTTTGTTCATGTATG

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG
24990481+             CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG
                      ____________________________________________________________
consensus             CTGGTAACCGTGGTGGTGCATATGCCAAGAACAGCTTTGGAAATATCTACACTGCTGTGG

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC
24990481+             GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC
                      ____________________________________________________________
consensus             GCGTTTTTGTTTTGGGTCGCTTGTTTCGTGAAGCTTGGGGGAGAGAAGCTCCTAAAATGC

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA
24990481+             AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA
                      ____________________________________________________________
consensus             AAGCGGAATTCAATGATTGTCTCGAGAAAAACCGAATAAGCATTTCAATGGAACTTGTCA

                          .    :    .    :    .    :    .    :    .    :    .    :
whvs7e09.R+           CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG
24990481+             CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG
                      ____________________________________________________________
consensus             CGGCTGTATTAGGAGACCATGGGCAAAGGCCTAAGGATGATTATGCTGTGATTACAGCTG


Number of segment pairs = 600; number of pairwise comparisons = 172
'+' means given segment; '-' means reverse complement

Overlaps            Containments  No. of Constraints Supporting Overlap

******************* SCL292Contig1 ********************
whsctal16b21.F+
                    9428019- is in whsctal16b21.F+
                    whxq28060f16.F+ is in whsctal16b21.F+
                    whxq29060f16.F+ is in whxq28060f16.F+
whxn14056i15.F+
whxq28060f16.R-
                    9362629- is in whxq28060f16.R-
whxq29060f16.R-
******************* SCL292Contig2 ********************
et|tplb0013k02+
                    whthls7l03.R+ is in et|tplb0013k02+
                    9561217+ is in et|tplb0013k02+
                    9561363+ is in et|tplb0013k02+
                    93033944+ is in 9561363+
                    14317289+ is in et|tplb0013k02+
                    32659187+ is in et|tplb0013k02+
                    32663705+ is in et|tplb0013k02+
                    93191970+ is in et|tplb0013k02+
                    32786704+ is in et|tplb0013k02+
                    whsctal16b21.R+ is in 32786704+
                    33217630+ is in et|tplb0013k02+
                    whxn14056i15.R+ is in 33217630+
                    55676375+ is in et|tplb0013k02+
                    93032669- is in et|tplb0013k02+
                    whv16n6d15.F- is in 93032669-
                    whv16n6d15.R+ is in 93032669-

DETAILED DISPLAY OF CONTIGS
******************* SCL292Contig1 ********************
                          .    :    .    :    .    :    .    :    .    :    .    :
whsctal16b21.F+       GGCATACTATAGCATCATTGTGGTCTGGAAACATTGGAGGGCTATAATGAAAAAAAATAC
                      ____________________________________________________________
consensus             GGCATACTATAGCATCATTGTGGTCTGGAAACATTGGAGGGCTATAATGAAAAAAAATAC

                          .    :    .    :    .    :    .    :    .    :    .    :
whsctal16b21.F+       TAAATTGAGTTGAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
9428019-                                        AGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxq28060f16.F+                 TCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxq29060f16.F+                 TCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA
whxn14056i15.F+                                     CCATACAACAACTGAAACTTTCTGGTGCTA
                      ____________________________________________________________
consensus             TAAATTGAGTTCAAGTCCAAGGAATTAGTGCCATACAACAACTGAAACTTTCTGGTGCTA

                          .    :    .    :    .    :    .    :    .    :    .    :
whsctal16b21.F+       CTTACACCTGGGTCAGGCTCTTGCAGAGCTGGAGCAAATTTGTAGCTCAGCGTTGCAATG
9428019-              CTTACACCTGGGTCAGGCTCTTGCAGAGCTGGAGCAAATTTGTAGCTCAGCGTTGCAATG
whxq28060f16.F+       CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq29060f16.F+       CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxn14056i15.F+       CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq28060f16.R-                  ATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
whxq29060f16.R-                  ATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG
                      ____________________________________________________________
consensus             CTTACACCTGGATCAGGCTCGTGCTGAGCTGGAGCAAATTTGTAGCTCAGCATTGCAATG

需要的结果是:


SCL279Contig1 whvs7e09.R 24990481
SCL279Contig2 et|RFL_Contig5917 24993123 whsctal27f01.R whxn27054l15.R whsctal3n06.F  whthkles18l09.R whsctal3n06.R 32771503 32678311 whxn27054l15.F-
SCL292Contig1 whsctal16b21.F 9428019 whxq28060f16.F whxq29060f16.F whxn14056i15.F whxq28060f16.R 9362629 whxq29060f16.R 
SCL292Contig2 et|tplb0013k02 whthls7l03.R 9561217 9561363 93033944 14317289 32659187 32663705 93191970 32786704 whsctal16b21.R 33217630 whxn14056i15.R 55676375 93032669 whv16n6d15.F whv16n6d15.R

我写的代码


# encoding: utf-8

with open('1.txt', 'r') as f:
    a = []
    b = []
    for num, line in enumerate(f):
        if 'Overlaps' in line:
            a.append(num)
        if 'DETAILED' in line:
            b.append(num)
    f.seek(0, 0)
    for i, j in zip(a, b):
        lines = f.readlines()[i:j+1]  #读取指定的行数
        if '**' in ''.join(lines):
            for i in range(len(lines)): #下边就是对获取的行数进行格式转换
                new = lines[i].strip().split()
                if '*******************' in new:
                    print '\n' + new[1],
                elif len(new) == 1:
                    print new[0][:-1],
                elif 'in' in new:
                    print new[0][:-1],
        f.seek(0, 0)

结果大概有10万条,上述代码运行了6个小时还没运行完,哪位大侠有优化的好点子吗?谢谢指点

阅读 3.6k
3 个回答

對於你的轉換細節我不了解,所以這邊我不多做考慮,純粹就你的代碼上的問題來提出一些可能增進效能的建議。

首先看的出來你採取的是二段式的處理手法。

  1. 先 parse 過整個檔案,找出包含關鍵字 'Overlaps''DETAILED' 的行數,並將行序記錄下來。

  2. 對於每一個由 'Overlaps''DETAILED' 劃分出來的區間,進行文本的轉換處理。

這種兩段式的處理手法本來就會造成效率的低落,因為大部分的轉換工作都可以一段式的完成,我們應該盡可能地邊讀邊處理。

不過兩段式的處理還不是造成跑不完的關鍵主因,主要還是你每處理一個 區間 就必須要重讀整個文件!

    for i, j in zip(a, b): # 對每個區間
        lines = f.readlines()[i:j+1]  #读取指定的行数
        # 處理該區間

你會發現從頭到尾你讀了這個文件 len(a)+1 次。對於十萬行的文件來說,這實在是太傷了。

那應該怎麼辦呢? 下面也許是一個可行的辦法:

with open('1.txt', 'r') as f:
    flag = False  # 利用 flag 來判斷是否在區間內
    for line in f:  # 此處也不需要 num 了
        if 'Overlaps' in line:
            flag = True  # 進入區間
        if flag:
            new = line.strip().split()
            if '*******************' in new:
                print '\n' + new[1],
            elif len(new) == 1:
                print new[0][:-1],
            elif 'in' in new:
                print new[0][:-1],
        if 'DETAILED' in line:
            flag = False  # 區間結束

這個代碼從頭到尾只讀了文件一次,對於區區十萬行的讀檔跟文件轉換,Python 絕對沒什麼問題,基本上也不需要任何平行的處理。

另外我在這篇中的回答可能對你也有幫助,你可以參考看看,加油。

建议你将这10万条记录拆分为1万条1个文件,然后用线程跑,一般按照你PC的核数来设置,会节省很多时间。

用正则表达式。

推荐问题
宣传栏