新手上路，请多包涵

我正在尝试为 .pdb 文件编写一个快速解析器（它们显示蛋白质结构）。我正在查看的蛋白质示例是 KRAS（常见于癌症），位于此处： http ://www.rcsb.org/pdb/files/3GFT.pdb

如果向下滚动足够远，您将看到如下所示的一行：ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N

第一个元素“原子”意味着这与蛋白质中的实际原子有关。 1 与一般计数有关，N 与原子类型有关，“MET”是残基的名称，“A”与链的类型有关，1（第二个“1”）是原子计数和那么接下来的 3 个数字是空间中的 xyz 位置。

我需要的输出是这样的（下面的“1”对应于原子计数，而不是一般计数）： MET A 1 63.645 97.355 31.526

使事情变得更复杂的是，有时原子数（在本例中为第二个“1”）为负数。在那些情况下，我想跳过该行并继续，直到我找到一个积极的条目，因为这些元素与找到位置所需的生物化学有关，而不是实际的蛋白质。更糟糕的是，有时你会得到这样一行：

原子 139 CA 艾尔 A 21 63.260 111.496 12.203 0.50 12.87 C

ATOM 140 CA 胆汁 A 21 63.275 111.495 12.201 0.50 12.17 C

虽然他们都提到残基 21，但生物化学不够精确，无法获得准确的位置，因此他们提供了两种选择。理想情况下，我会指定“1”、“2”或其他任何内容，但如果我只选择第一个选项就可以了。最后，对于我原始示例中的原子类型（“N”），我只想获得带有“CA”的那些行。

我是 python 的新手，我接受过生物统计方面的培训，所以我想知道最好的方法是什么？我是否使用 for 循环逐行解析？有没有办法在 python 中更快地完成它？我如何处理某些原子的双重条目？

我知道这有点问，但一些指导会很有帮助！我已经使用 R 编写了所有统计位，但现在我只需要以正确的格式获取我的文件！

谢谢！

原文由 user1357015 发布，翻译遵循 CC BY-SA 4.0 许可协议

python

阅读 871

2 个回答

得票最新

社区维基

发布于
2023-01-09

✓ 已被采纳

那是一个很长的描述。我不确定我是否正确理解了所有内容 :-) 如果字段（对于以 ATOM 开头的行）是固定的，您可以使用拆分和一些 comaprisons。我已经使用散列来查看是否已经看到该条目以消除您想要的重复项。希望这会给你一个开始，

 visited = {}
for line in open('3GFT.pdb'):
    list = line.split()
    id = list[0]
    if id == 'ATOM':
        type = list[2]
        if type == 'CA':
            residue = list[3]
            type_of_chain = list[4]
            atom_count = int(list[5])
            position = list[6:8]
            if atom_count >= 0:
                if type_of_chain not in visited:
                    visited[type_of_chain] = 1
                    print residue,type_of_chain,atom_count,' '.join(position)

会输出，

 MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590

原文由 dpp 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-09

我有点惊讶没有人提到 BioPython 的 Bio.PDB 包。自己编写 PDB 解析器是一个相当严重的不必要的重新发明案例，我的意思是重新实现轮子。

BioPython 是一个有用的包集合，可用于处理其他类型的生物数据（例如蛋白质或核酸序列）。

更新

我通过示例添加了一些用于下载和读取 PDB 文件的函数：

 import os
import sys
import urllib.request

import Bio
import Bio.PDB
import Bio.SeqRecord

def download_read_pdb(pdbcode, datadir, keepfile=True):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    Then it reads and returns the structure inside.
    :param pdbcode: The standard PDB ID e.g. '3ICB'
    :param datadir: The directory where the downloaded file will be saved
    :param keepfile: if False, then the downloaded file will be deleted (default: keep the downloaded file)
    :return: a Bio.PDB Structure object or None if something went wrong
    """
    pdbfilenm = download_pdb(pdbcode, datadir)
    if pdbfilenm is None:
        return None
    struct = read_pdb(pdbcode, pdbfilenm)
    if not keepfile:
        os.remove(pdbfilenm)
    return struct

def download_pdb(pdbcode, datadir, downloadurl="http://files.rcsb.org/download/"):
    """
    Downloads a PDB file from the Internet and saves it in a data directory.
    :param pdbcode: The standard PDB ID e.g. '3ICB' or '3icb'
    :param datadir: The directory where the downloaded file will be saved
    :param downloadurl: The base PDB download URL, cf.
        `https://www.rcsb.org/pages/download/http#structures` for details
        Note that the unencrypted HTTP protocol is used by default
        to avoid spurious OpenSSL errors...
    :return: the full path to the downloaded PDB file or None if something went wrong
    """
    pdbfn = pdbcode + ".pdb"
    url = downloadurl + pdbfn
    outfnm = os.path.join(datadir, pdbfn)
    try:
        urllib.request.urlretrieve(url, outfnm)
        return outfnm
    except Exception as err:
        # all sorts of things could have gone wrong...
        print(str(err), file=sys.stderr)
        return None

def read_pdb(pdbcode, pdbfilenm):
    """
    Read a PDB structure from a file.
    :param pdbcode: A PDB ID string
    :param pdbfilenm: The PDB file
    :return: a Bio.PDB.Structure object or None if something went wrong
    """
    try:
        pdbparser = Bio.PDB.PDBParser(QUIET=True)   # suppress PDBConstructionWarning
        struct = pdbparser.get_structure(pdbcode, pdbfilenm)
        return struct
    except Exception as err:
        print(str(err), file=sys.stderr)
        return None

def extract_seqrecords(pdbcode, struct):
    """
    Extracts the sequence records from a Bio.PDB structure.
    :param pdbcode: the PDB ID of the structure, needed to add a sequence ID to the result
    :param struct: a Bio.PDB.Structure object
    :return: a list of Bio.SeqRecord objects
    """
    ppb = Bio.PDB.PPBuilder()
    seqrecords = []
    for i, chain in enumerate(struct.get_chains()):
        # extract and store sequences as list of SeqRecord objects
        pps = ppb.build_peptides(chain)    # polypeptides
        seq = pps[0].get_sequence() # just take the first, hope there's no chain break
        seqid = pdbcode + chain.id
        seqrec = Bio.SeqRecord.SeqRecord(seq, id=seqid,
            description="Sequence #{}, {}".format(i+1, seqid))
        seqrecords.append(seqrec)
    return seqrecords

def get_calphas(struct):
    """
    Extracts the C-alpha atoms from a PDB structure.
    :param struct: A Bio.PDB.Structure object.
    :return: A list of Bio.PDB.Atom objects representing the C-alpha atoms in `struct`.
    """
    calphas = [ atom for atom in struct.get_atoms() if atom.get_fullname() == " CA " ]
    return calphas

我知道这是一个老问题，但也许有人仍然觉得这个指针有用。

原文由 András Aszódi 发布，翻译遵循 CC BY-SA 4.0 许可协议

查看全部 2 个回答

推荐问题

在 Python 中解析 .pdb 文件

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译