Introduction: Abstract: This article will start from the huge demand for computing power of machines in the life science industry, show the needs and pain points that the industry is currently facing at the infrastructure layer, and answer why high-performance computing on the cloud will greatly help life science companies. Rapid development.
Article | Alibaba Cloud Elastic High Performance Computing Team
The life sciences industry is ushering in a golden age of development. The development of medicine and people's pursuit of health are rapidly transforming into new kinetic energy for the development of the entire industrial chain of life sciences. High-performance computing (HPC) plays a very important role in life sciences research. At the same time, with the rapid development of the life science industry, we can see that the industry's cloud migration has become an unstoppable trend.
Thanks to the elasticity and convenience of the cloud, an industry's urgent demand for cloud computing is often inseparable from its rapid development. The long process of stocking, delivery, and deployment of traditional IT determines that it cannot meet the rapidly growing IT needs of a fast-growing industry. .
This article will start with the huge demand for computing power of machines in the life science industry, show what needs and pain points the industry is currently facing at the infrastructure level, and answer why high-performance computing on the cloud will greatly contribute to the rapid development of life science companies.
1. Demand for computing power in life sciences: large scale, high performance, and rich types
At present, the two most important scenarios in the life science industry are computer-aided drug design and gene sequencing.
1. Computer-aided drug development
Since the 21st century, due to the continuous increase in the complexity of diseases, the number of druggable targets has gradually decreased, the difficulty and cost of new drug research and development have increased significantly, and the global success rate of new drug research and development has shown a significant downward trend. Innovative drug R&D is the key to building core competitiveness and sustainable development of pharmaceutical companies, and drug R&D is a systematic project with high investment, high technology, high risk and long cycle. Pharmaceutical companies have begun to seek AI, big data and other computer technologies to assist in drug research and development.
The whole process of drug development
The birth of a new drug usually needs to go through the stages of drug discovery, preclinical research, clinical trials and approval before it can finally be approved for marketing. In drug discovery stages such as target discovery, compound synthesis, and preclinical research stages such as compound screening, the powerful computing power of high-performance computing is often needed to accelerate the R&D process to assist drug design.
When conducting protein structure prediction in the target discovery process, there are not only prediction schemes based on molecular dynamics and plane waves, but also solutions based on AI for Science.
The former is a typical application scenario of high-performance computing HPC. There are solutions of mature software such as VASP and Gromacs, and simulation results are obtained through calculation. In this scheme, the size of the simulation problem is proportional to the amount of computing resources.
At the same time, solutions such as AlphaFold2 have gradually emerged in the industry. By using AI technology to establish the relationship between protein sequence and structure, it can continuously learn the known sequence and structure to predict the protein structure. With the support of powerful algorithms and computing power, DeepMind shortens the computing time from months to hours. As the scale of network model parameters increases, the requirements for computing power are also getting higher and higher.
AI prediction of protein three-dimensional structure
Similarly, during virtual compound screening, pharmaceutical companies often need to dock millions of molecular and protein structures. Each ligand molecule requires computing resources to obtain the docking score, so as to screen out the molecules that can be used for experimental verification of activity. Facing the massive ligand molecule library, it requires huge computing power to support the docking work of molecules and protein structures. Obviously, the computing power of a single machine is difficult to perform such a large-scale virtual screening task, so it is very important to use high-performance computing HPC clusters for large-scale virtual screening tasks.
Lead Discovery Process
In the process of target discovery, compound screening and compound synthesis, different computing modes, parameters and software often require different computing resources. Especially with the introduction of AI, higher requirements are put forward for the diversified configuration of multiple resources.
2. Gene sequencing
The business process of gene sequencing mainly includes sample loading (sequencer), sequencing file generation, gene sequence comparison and result analysis (computer), and the result data and reports are delivered to various scientific research and medical institutions. Among them, gene sequence comparison and analysis are extremely time-consuming and involve a large number of professional software in the field of biotechnology. The computing power performance of computing resources and program optimization play a crucial role in the efficiency of bioinformatics research and development.
Gene Sequencing Business Process
For the typical WGS (Human Whole Genome Sequencing) process of gene sequencing, due to the steps involved in library index construction, reads alignment, sequencing, deduplication, BQSR correction and Caller, the methods are diverse and the process is complicated, and different steps correspond to BWA, GATK, etc. Different software and parameters, different biosignature software may correspond to different concurrency capabilities and performances, and different screening tasks have different diversity and scale of computing resources, not only flexible computing resources, but also diverse. instance configuration.
Next Generation Gene Sequencing WGS Sequencing Process
2. Pain points and challenges faced by life sciences at the infrastructure level
It turns out that most life science companies have adopted the method of building their own IDC computer room offline. Generally speaking, the IT infrastructure of life science enterprises mainly faces three major problems: fixed resource scale , long construction period and high hardware resource operation and maintenance costs . The specific manifestations are as follows:
1. Fixed resources, unable to meet the needs of business growth and resource diversity
1.1 The scale of computing power is fixed, which affects the speed of business growth
At the beginning of building a traditional IDC, the resource scale is often clearly planned, so the task throughput of the entire cluster is fixed. For the cyclical new drug R&D and sequencing business, different R&D cycles and R&D tasks have different resource requirements, so it usually happens: tasks queue up due to waiting for resources during peak periods, and during trough periods There is another problem of idle resources, which requires flexible computing resources to process services.
1.2 Fixed resource allocation, unable to meet the needs of resource diversity
Because the computing resources of the local IDC room are planned in the early stage, and the allocation of resources is limited, the traditional sequencing method often uses the same resources to complete the execution of different sequencing steps, which cannot be flexibly changed, resulting in a large amount of computing Waste of resources. However, as mentioned earlier, the computing resources required are flexible and multilateral.
1.3 The storage capacity is fixed, which cannot meet the ever-increasing storage demands of users
For the growing storage scale, Shengxin enterprises are facing great pressure on offline storage equipment operation and maintenance and storage equipment procurement costs. How to obtain efficient, safe, stable, cost-effective and sustainable storage solutions is also a life science A major problem faced by enterprises.
Taking protein structure research as an example, there are generally three methods to determine protein structure: X-ray crystallography, nuclear magnetic resonance and cryo-electron microscopy. Taking cryo-EM as an example, the electron microscope data of a single sample is generally 10TB, and the local data volume of the enterprise is PB. At the same time, bioinformatics research data includes a large number of reference library data, sample data and intermediate data files. Among them, the whole process data of a single human whole genome sequencing is 1TB in size. Due to the periodicity and particularity of the bioinformatics data, the storage capacity of the local data of the usual bioinformatics enterprises reaches the petabyte level.
2. Long construction period affects business growth
2.1 The delivery cycle is long and cannot meet the user's immediate needs for immediate use
Traditional IDC construction generally needs to go through the process of project establishment, bidding, procurement and delivery, and often requires a construction cycle of several months or even a year. In the process of project establishment, it is necessary to evaluate the scale of the follow-up business and clarify the resource construction plan. For the fast-growing business, such a long construction period will become a bottleneck for the fast-growing business.
2.2 The iteration of hardware resource selection is slow, and it cannot meet the user's constantly escalating resource requirements
In traditional IDC construction, it is often difficult for enterprises to quickly obtain the hardware resources of the latest architecture, and these resources can often bring considerable acceleration to the business.
For example, compared to the Volta architecture, the single-precision training of the NVIDIA A100 architecture can provide up to a 20-fold acceleration, which is a great boost for protein structure prediction accelerated by AI technology.
For WGS sequencing, there are also a large number of selection and verification processes for the development of heterogeneous acceleration solutions based on GPU or FPGA. In offline IDC construction, it is not only necessary to consider the release time of products such as CPU/GPU/FPGA, and to select appropriate hardware specifications, but also to evaluate the development of business architecture, which will be a huge challenge for all kinds of life science enterprises when building resources .
3 High operation and maintenance costs
The operation and maintenance of the offline IDC room also requires a large human input. In addition to the management of cluster computing resources, the scheduling of computing tasks, and the management of user rights, the stability of computing resources, especially hardware failures, will have a serious impact on business progress. If the task is terminated due to downtime during the calculation process, it can only be recalculated without checkpointing. In addition, offline storage also needs to consider disaster recovery to avoid data loss due to hardware failures. Therefore, the management of computing resources, resource stability, data disaster recovery and other tasks all need a dedicated operation and maintenance team to be responsible for, which increases the cost invisibly.
At present, due to the problems of resource limitations, long delivery cycles, inelastic resources, slow iterative upgrade of hardware resources, and high operation and maintenance costs in the infrastructure provided by traditional IDCs, more and more life science companies are turning to more flexible, stable, and cost-effective cloud-based high-performance computing solutions to accelerate business innovation.
3. Alibaba Cloud EHPC Life Science Series Solutions
Alibaba Cloud believes that high-performance computing on the cloud is the best way to build and use HPC. In response to the relevant needs of the life science industry, Alibaba Cloud provides high-performance computing public cloud solutions, hybrid cloud solutions, large-memory instance performance optimization solutions, containerized solutions, Pharmaceutical AI solutions, etc., can cover and solve the needs of different scenarios in the industry, and have the following advantages:
(1) Rich computing power and purchase on demand : Alibaba Cloud operates 27 public cloud regions and 84 availability zones on four continents around the world; the automatic scaling capability on the cloud supports scheduling across data centers to meet the computing resources required for large-scale parallel jobs The type can also be flexibly configured according to the scheduler queue to support heterogeneous computing power of multiple specifications, as well as large-memory, high-frequency and other specifications of CPU instances;
(2) Elastic scaling, cost reduction and efficiency enhancement: Alibaba Cloud’s elastic high-performance computing E-HPC platform can dynamically create/delete computing nodes, flexibly configure scaling policies, and flexibly charge according to actual load. The price of preemptible instances is as low as 10%. Reduce customer usage costs, improve work quality and speed;
(3) Minimal operation and maintenance, allowing enterprises to focus on core business development : fully compatible with HPC business, automatically build clusters, provide job performance analysis, locate hot spots based on clusters, instances, processes and other dimensions, support visual output of job reports, provide users, tasks , queue and other dimensions of consumption composition;
(4) Empowerment of new technologies to quickly enjoy dividends : At the IaaS layer, Alibaba Cloud continues to iterate the latest computing power. SaaS and PaaS have hundreds of third-party partners to integrate Alibaba Cloud, allowing life science companies to quickly obtain relevant technical services. Alibaba Cloud's rich ecosystem and continuous iterative technical capabilities on the cloud help enterprises enjoy full-process technical services and the latest technology dividends.
Alibaba Cloud high-performance computing has been widely used in industrial simulation (CAD/CAE), chip design (EDA), biomedical materials, energy exploration and public services and other industries.
Shenshi Technology uses the cost optimization strategy of elastic supply, combined with the price of preemptive instances, to complete the delivery of massive resources at 30% of the cost. At the same time, the E-HPC automatic operation and maintenance feature of elastic high-performance computing on the cloud reduces the operation and maintenance cost of Shenshi Technology and improves the efficiency of cluster management.
Shengting Medical, a life medicine company, optimized the data reliability, operation and maintenance cost and efficiency of traditional IDC clusters by going to the cloud, and the efficiency of gene comparison and analysis was increased by 70%. Alibaba Cloud's high-performance computing team also reduces the waste of ineffective computing resources and effectively reduces the cost of use by combining the Slurm business workflow dependency and automatic scaling.
Please click the following link to enter the "Alibaba Cloud Life Science Best Practices" special page to learn more about the details of the plans and cases: https://developer.aliyun.com/topic/life\_science\_best\_practice
Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。