Abstract: Recently, Huawei and Peking University Biomedical Frontier Innovation Center (BIOPIC), Peking University School of Chemistry and Molecular Engineering, Shenzhen Bay Laboratory Professor Gao Yiqin's research group jointly launched a protein multiple sequence alignment (Protein MSA) data set
This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD and Peking University BIOPIC jointly released the open source data set for protein multiple sequence alignment", author: MKT Huang Buzheng.
Recently, Huawei and Peking University’s Biomedical Frontier Innovation Center (BIOPIC), Peking University’s School of Chemistry and Molecular Engineering, and the Shenzhen Bay Laboratory Professor Gao Yiqin’s group jointly launched a protein multiple sequence alignment (Protein MSA) data set. Based on the integration, support researchers to develop advanced AI models, deepen their understanding of protein structure, function and evolution, and carry out protein design and transformation. This data set will be released on the Huawei Cloud AI Gallary platform. The relevant code and data set description will be open sourced, regularly expanded and maintained based on Huawei’s full-scenario AI computing framework MindSpore, aiming to provide relevant production, academic, and research teams around the world. Provide high-quality data sharing solutions.
The open source Protein MSA data set completely covers the protein sequences in the latest version (released in February 2021) of the UniRef50 database. Using the academic "gold standard" search method, about 50 million protein sequences were fully MSA Search and comparison (MSA average depth is greater than 1000), is currently the world's largest open source protein MSA data set, the latest reference data set, the widest coverage (the previous largest open source MSA data set contains 100,000 protein MSAs) 【1】.
There are more than 440 million protein sequences known to humans, but it is difficult to understand the relationship between proteins based on these single protein sequence databases. The Protein MSA database is a large-scale "relational" database that marks the relationship between different protein sequences. It is marked as a pair of information such as the similarity, evolutionary relationship, and the distribution of mutation sites between related protein sequences. The prediction of protein structure and function is extremely important.
In order to better serve researchers across fields, the Protein MSA data set will be organized into multiple data formats. The original data set (nearly 30T) will be stored in the standard text form of UniRef series database [2] and UniClust database [3], and will be divided and compressed according to the sequence length. In order to facilitate the direct use by researchers in the AI field, the Protein MSA data set will also convert the data set in text format into a floating-point tensor type compressed storage, and provide data interface support for existing AI frameworks such as MindSpore.
Professor Gao Yiqin said: “We encourage and look forward to the full collision and cooperation of experts and talents from the fields of bioinformatics, data science and AI research to introduce, improve or design new AI models to fully explore the hidden areas in the Protein MSA data set. 'Nature's Secret'".
From a scientific point of view, the quantity and quality of MSA have largely affected the prediction speed and accuracy of the most advanced structural models, and the non-parametric algorithm that generates MSA is still one of the main steps in determining the speed of many protein prediction methods. . Therefore, the Protein MSA database itself can be used as the pre-training material for these structural prediction models to mine sequence information and even quickly generate new sequence features. This is useful for solving the problems of highly variable sequences and orphan sequences faced in research and design of proteins. Great potential value.
The release of the database, relying on the HUAWEI CLOUD AI Gallery platform, can fully guarantee the access and download of data sets by users at home and abroad, and provide advanced data maintenance solutions that can be continuously updated and expanded, and related support for downstream AI applications and deployment. Combines the advantages of a research model that combines production, learning, and research. In addition, Huawei and Peking University Gao Yiqin's research group jointly developed and open sourced the first domestic molecular dynamics software MindSponge. In the future, Huawei will work with more academic and scientific partners to create a new data-driven research model in the broader scientific computing fields such as materials, biology, and medicine.
Attached:
Data set open source description: https://gitee.com/mindspore/mindscience/tree/master/MindSPONGE/protein_msa
Data set download address: https://marketplace.huaweicloud.com/markets/aihub/datasets/detail/?content_id=5802def2-5fbd-40da-85d8-a4541d1c6f1e
【1】AlQuraishi, Mohammed. "ProteinNet: a standardized data set for machine learning of protein structure." BMC bioinformatics 20.1 (2019): 1-10.
【2】Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H., & UniProt Consortium. (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6), 926-932.
【3】Mirdita M., von den Driesch L., Galiez C., Martin M. J., Söding J.#, and Steinegger M.#, Uniclust databases of clustered and deeply annotated protein sequences and alignments, Nucleic Acids Res. 2016.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。