已知的基因组序列是由ACGT四个字母组成的一大串字符如“ATCGCTCGCTCGAGCAGTCAGTCA......”,求所有在500bp(即500个字母)内所有的出现了3次的连续9bp(即9个字母)的类型,在多个500bp窗口出现的重复的只记为一个。!

基因组中寻找特定要求序列，求一个计算速度快的方法

Sxuan

4921117

发布于
2016-09-12

更新于
2016-09-23

✓ 已被采纳

图片描述 1

ferstar

1.2k1111

发布于
2016-09-11

这个问题蛮有意思, 贴上我的代码(用到了生成器以及多线程)

from multiprocessing import cpu_count
from multiprocessing.dummy import Pool


def generate_index(n, step=5):
    for i in range(n):
        if i + step < n:
            yield (i, i + step)
        else:
            yield (i, None)
            break


def pick_up(st):
    for item in (st[i: j] for i, j in generate_index(len(st), 9) if j is not None):
        if st.count(item) == 3:
            result.add(item)


with open('gene.txt', 'r') as fh:
    src_lst = fh.readline().strip()
    sep_lst = [src_lst[i: j] for i, j in generate_index(len(src_lst), 500)]

result = set()
pool = Pool(cpu_count())
pool.map(pick_up, sep_lst)
pool.close()
pool.join()
print(result)

随便捏了点数据, 居然真找到这么几个:

{'TCGATCGAT', 'CGATCGATC', 'ATCGATCGA', 'GATCGATCG'}

liuhao2050

14019

发布于
2016-09-12

目测可以使用collections 中的 counter

忆先

4.2k2317

发布于
2016-09-11

每九个序列一读和之前的两组组九个序列做比较，如果不同就丢弃掉之前的那一组。

ans = []
s = list(s)
first, s = s[:9], s[9:]
second, s = s[:9], s[9:]
third, s = s[:9], s[9:]
while len(s) > 0:
    if first == second and second == third:
        ans.append(first)
    else:
        first, second, third, s = second, third, s[:9], s[9:]

FGFW

40217

发布于
2016-09-12

更新于
2016-09-12

julia有个基因处理方面的库
也可以试试。

https://github.com/OpenGene/O...

基因组中寻找特定要求序列，求一个计算速度快的方法

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？