如何对文件中的序列长度做过滤处理

有一个文本文件,内容包括序列id和其序列内容,以>开头的行表示id,其后面的几行表示序列内容,请教该如何过滤掉长度小于l的序列。部分文件如下:

>NNH52_c0_seq1
ATGAGCATTGCCTTACCTCCTCTAAATCATACATAATATTGCAGGCTCTCCCTTTTTACC
ACCTCGCCAATTACCTTTACAAAGAGGATTAAGAATTTAATCCTGAGCAACCACCTAGTT
TTATTAATGCACATTGACATAGCAGTATCTTTTTGCTCATAGAACAAGTTCAGATATTAT
TTGGCCTTTGATAA
>NNH262_c0_seq1
AATTCTCACGCCTCCAGTGAACCTTATAGTATTAATGACAGAGTTACCACCTCCGACAAA
GGAACGACGTTTCAAACCCTGCGTCTCTCTACCTTGTCAGGCGAGTAGGCGGAACGACTC
TTGGACATCCCACCTGTGATCACCG
>NNH774_c0_seq1
GGTAGCCCCGCCGTGGATGCGCCGCCATCTCTTCGTCTGTGTCTCCTTCCCTTCTCTCTC
TCTCTCTCCCCACTTGAAACTGGGTTGGCCCAGAAATCTTCCGTTCCGCTGAAAAGCTCC
TCTCTCTCTCTTCTCTCTCTCTCTCTTTCTCT
AGGCAG
>NNH4124_c0_seq1
ATGATATCCTTCGCCTACAACCATCAGTACCAGCAGATCGCCGCTGCAGGAGGGTACCCA
CTG

现在文件很大所以打算读一条序列处理一条序列,如果全部读入字典的话太占内存了,我自己写了个,但是感觉有点啰嗦,请教该如何处理,谢谢!
附上自己代码吧。。

def rmShort(in_file, out_file, length):
    cunt = defaultdict(str)
    with open(in_file) as f_in, open(out_file, 'w') as f_out:
        for line in f_in:
            if line.startswith('>'):
                try:  # 在读取后一条序列的时候处理前一条序列,所以刚刚读取第一行的时候会报错
                    seq = cunt.pop(id_)
                    if len(seq) > length:
                        f_out.write(id_)
                        f_out.write(seq)
                except:
                    pass
                finally:
                    id_ = line
            else:
                cunt[id_] += line
        for seq_id, seq in cunt.items():  # 对于最后一行,无法读取其后一行时处理它,故拿出来专门处理
            if len(seq) > length:
                f_out.write(seq_id)
                f_out.write(seq)
阅读 3.6k
3 个回答
def read_custom_file(in_file):
    with open(in_file, 'r') as f:
        seq = ''
        tag = False
        while True:
            pos_previous = f.tell()
            line = f.readline()
            if not line:
                yield seq
                break
            else:
                if line.startswith('>') and tag:
                    if len('seq') < 10:
                        # add your custom code here
                        pass
                    else:
                        yield seq
                    seq = ''
                    f.seek(pos_previous)
                    tag = False
                elif line.startswith('>'):
                    tag = True
                    seq += line
                else:
                    seq += line

for line in read_custom_file('sample.txt'):
    print(line)
for l in f.readlines():
    if len(l) > 1:
        # do...
        

做生信的吧? 何必重复造轮子

  1. Biopython
  2. cutadapt
from Bio import SeqIO

input_seq_iterator = SeqIO.parse("test.fasta", "fasta")
long_seq_iterator = (record for record in input_seq_iterator if len(record.seq) > 1)

SeqIO.write(long_seq_iterator, "long_seqs.fasta", "fasta")
  • 或者自己造个轮子
def fasta_parser(handle):
    # 跳过开头可能的注释, 空格之类内容
    while True:
        line = handle.readline()
        if not line:
            return  # 处理文件过早结束或者为空的情形
        if line.startswith(">"):
            break

    while True:
        if not line.startswith(">"):
            raise ValueError(
                "Tags in Fasta files should start with '>' character")
        tag = line.strip()
        seqs = ""
        line = handle.readline()
        # 处理换行序列
        while True:
            if not line:
                break
            if line.startswith(">"):
                break
            seqs += line.strip()  # 拼接
            line = handle.readline()

        # 去除序列中可能的空格及\r
        yield tag, seqs.replace(" ", "").replace("\r", "")

        if not line:
            return  # 文件读完, 停止迭代


with open("test.txt", "r") as fh:
    for i in fasta_parser(fh):
        if len(i[1]) > 10:
            print('\n'.join(i))
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题