1
|
Camacho C, Boratyn GM, Joukov V, Vera Alvarez R, Madden TL. ElasticBLAST: accelerating sequence search via cloud computing. BMC Bioinformatics 2023; 24:117. [PMID: 36967390 PMCID: PMC10040096 DOI: 10.1186/s12859-023-05245-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 03/21/2023] [Indexed: 03/28/2023] Open
Abstract
BACKGROUND Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. RESULTS We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. CONCLUSION We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud.
Collapse
Affiliation(s)
- Christiam Camacho
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Grzegorz M. Boratyn
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Victor Joukov
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Roberto Vera Alvarez
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Thomas L. Madden
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| |
Collapse
|
2
|
Camacho C, Boratyn GM, Joukov V, Alvarez RV, Madden TL. ElasticBLAST: Accelerating Sequence Search via Cloud Computing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.04.522777. [PMID: 36789435 PMCID: PMC9928022 DOI: 10.1101/2023.01.04.522777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Background Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. Results We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. Conclusion We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud.
Collapse
Affiliation(s)
- Christiam Camacho
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Grzegorz M. Boratyn
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Victor Joukov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Roberto Vera Alvarez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | | |
Collapse
|
3
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
4
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
5
|
Elisseev V, Gardiner LJ, Krishna R. Scalable in-memory processing of omics workflows. Comput Struct Biotechnol J 2022; 20:1914-1924. [PMID: 35521547 PMCID: PMC9052061 DOI: 10.1016/j.csbj.2022.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 04/11/2022] [Accepted: 04/11/2022] [Indexed: 11/17/2022] Open
Affiliation(s)
- Vadim Elisseev
- IBM Research Europe, Hartree Centre, Daresbury Laboratory, Keckwick Lane, WarringtonWA4 4AD, Cheshire, UK
- Wrexham Glyndwr University, Mold Rd, Wrexham LL11 2AW, Wales, UK
| | - Laura-Jayne Gardiner
- IBM Research Europe, Hartree Centre, Daresbury Laboratory, Keckwick Lane, WarringtonWA4 4AD, Cheshire, UK
| | - Ritesh Krishna
- IBM Research Europe, Hartree Centre, Daresbury Laboratory, Keckwick Lane, WarringtonWA4 4AD, Cheshire, UK
| |
Collapse
|
6
|
The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU. Symmetry (Basel) 2021. [DOI: 10.3390/sym13122385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In the field of computational biology, sequence alignment is a very important methodology. BLAST is a very common tool for performing sequence alignment in bioinformatics provided by National Center for Biotechnology Information (NCBI) in the USA. The BLAST server receives tens of thousands of queries every day on average. Among the procedures of BLAST, the hit detection process whose core architecture is a lookup table is the most time-consuming. In the latest work, a lightweight BLASTP on CUDA GPU with a hybrid query-index table was proposed for servicing the sequence query length shorter than 512, which effectively improved the query efficiency. According to the reported protein sequence length distribution, about 90% of sequences are equal to or smaller than 1024. In this paper, we propose an improved lightweight BLASTP to speed up the hit detection time for longer query sequences. The largest sequence is enlarged from 512 to 1024. As a result, one more bit is required to encode each sequence position. To meet the requirement, an extended hybrid query-index table (EHQIT) is proposed to accommodate three sequence positions in a four-byte table entry, making only one memory access sufficient to retrieve all the position information as long as the number of hits is equal to or smaller than three. Moreover, if there are more than three hits for a possible word, all the position information will be stored in contiguous table entries, which eliminates branch divergence and reduces memory space for pointers to overflow buffer. A square symmetric scoring matrix, Blosum62, is used to determine the relative score made by matching two characters in a sequence alignment. The experimental results show that for queries shorter than 512 our improved lightweight BLASTP outperforms the original lightweight BLASTP with speedups of 1.2 on average. When the number of hit overflows increases, the speedup can be as high as two. For queries shorter than 1024, our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLAST. In short, the improved lightweight BLASTP can replace the original one because it can support a longer query sequence and provide better performance.
Collapse
|
7
|
Zhu F, Liu M, Wang F, Qiu D, Li R, Dai C. Automatic measurement of fetal femur length in ultrasound images: a comparison of random forest regression model and SegNet. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:7790-7805. [PMID: 34814276 DOI: 10.3934/mbe.2021387] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The aim of this work is the preliminary clinical validation and accuracy evaluation of our automatic algorithms in assessing progression fetal femur length (FL) in ultrasound images. To compare the random forest regression model with the SegNet model from the two aspects of accuracy and robustness. In this study, we proposed a traditional machine learning method to detect the endpoints of FL based on a random forest regression model. Deep learning methods based on SegNet were proposed for the automatic measurement method of FL, which utilized skeletonization processing and improvement of the full convolution network. Then the automatic measurement results of the two methods were evaluated quantitatively and qualitatively with the results marked by doctors. 436 ultrasonic fetal femur images were evaluated by the two methods above. Compared the results of the above three methods with doctor's manual annotations, the automatic measurement method of femur length based on the random forest regression model was 1.23 ± 4.66 mm and the method based on SegNet was 0.46 ± 2.82 mm. The indicator for evaluating distance was significantly lower than the previous literature. Measurement method based SegNet performed better in the case of femoral end adhesion, low contrast, and noise interference similar to the shape of the femur. The segNet-based method achieves promising performance compared with the random forest regression model, which can improve the examination accuracy and robustness of the measurement of fetal femur length in ultrasound images.
Collapse
Affiliation(s)
- Fengcheng Zhu
- Department of Gynaecology and Obstetrics, the First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Mengyuan Liu
- Department of Gynaecology and Obstetrics, the First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Feifei Wang
- Anesthesiology department, the First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Di Qiu
- Department of Gynaecology and Obstetrics, the First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Ruiman Li
- Department of Gynaecology and Obstetrics, the First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Chenyang Dai
- Department of Gynaecology and Obstetrics, the First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
8
|
Dash S, Rahman SR, Hines HM, Feng WC. iBLAST: Incremental BLAST of new sequences via automated e-value correction. PLoS One 2021; 16:e0249410. [PMID: 33886589 PMCID: PMC8062096 DOI: 10.1371/journal.pone.0249410] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 03/17/2021] [Indexed: 11/19/2022] Open
Abstract
Search results from local alignment search tools use statistical scores that are sensitive to the size of the database to report the quality of the result. For example, NCBI BLAST reports the best matches using similarity scores and expect values (i.e., e-values) calculated against the database size. Given the astronomical growth in genomics data throughout a genomic research investigation, sequence databases grow as new sequences are continuously being added to these databases. As a consequence, the results (e.g., best hits) and associated statistics (e.g., e-values) for a specific set of queries may change over the course of a genomic investigation. Thus, to update the results of a previously conducted BLAST search to find the best matches on an updated database, scientists must currently rerun the BLAST search against the entire updated database, which translates into irrecoverable and, in turn, wasted execution time, money, and computational resources. To address this issue, we devise a novel and efficient method to redeem past BLAST searches by introducing iBLAST. iBLAST leverages previous BLAST search results to conduct the same query search but only on the incremental (i.e., newly added) part of the database, recomputes the associated critical statistics such as e-values, and combines these results to produce updated search results. Our experimental results and fidelity analyses show that iBLAST delivers search results that are identical to NCBI BLAST at a substantially reduced computational cost, i.e., iBLAST performs (1 + δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We then present three different use cases to demonstrate that iBLAST can enable efficient biological discovery at a much faster speed with a substantially reduced computational cost.
Collapse
Affiliation(s)
- Sajal Dash
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
- Department of Computer Science, Virginia Tech, Blacksburg, VA, United States of America
| | - Sarthok Rasique Rahman
- Department of Biology, The Pennsylvania State University, University Park, PA, United States of America
- Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL, United States of America
| | - Heather M. Hines
- Department of Biology, The Pennsylvania State University, University Park, PA, United States of America
- Department of Entomology, The Pennsylvania State University, University Park, PA, United States of America
| | - Wu-chun Feng
- Department of Computer Science, Virginia Tech, Blacksburg, VA, United States of America
- Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, United States of America
- Department of Biomedical Engineering and Mechanics, Virginia Tech, Blacksburg, VA, United States of America
- Health Sciences, Virginia Tech, Blacksburg, VA, United States of America
| |
Collapse
|
9
|
Pal S, Mondal S, Das G, Khatua S, Ghosh Z. Big data in biology: The hope and present-day challenges in it. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100869] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
10
|
Malik G, Agarwal T, Raj U, Sundararajan VS, Bandapalli OR, Suravajhala P. Hypothetical Proteins as Predecessors of Long Non-coding RNAs. Curr Genomics 2020; 21:531-535. [PMID: 33214769 PMCID: PMC7604745 DOI: 10.2174/1389202921999200611155418] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Revised: 04/28/2020] [Accepted: 05/16/2020] [Indexed: 02/07/2023] Open
Abstract
Hypothetical Proteins [HP] are the transcripts predicted to be expressed in an organism, but no evidence of it exists in gene banks. On the other hand, long non-coding RNAs [lncRNAs] are the transcripts that might be present in the 5’ UTR or intergenic regions of the genes whose lengths are above 200 bases. With the known unknown [KU] regions in the genomes rapidly existing in gene banks, there is a need to understand the role of open reading frames in the context of annotation. In this commentary, we emphasize that HPs could indeed be the predecessors of lncRNAs.
Collapse
Affiliation(s)
- Girik Malik
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| | - Tanu Agarwal
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| | - Utkarsh Raj
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| | - Vijayaraghava Seshadri Sundararajan
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| | - Obul Reddy Bandapalli
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| | - Prashanth Suravajhala
- 1Khoury College of Computer Sciences, Northeastern University, 360 Huntington Ave., Boston, MA02115, USA; 2Bioclues.org, Kukatpally, Hyderabad, 500072, India; 3Labrynthe Pvt. Ltd., New Delhi, India; 4NIIT University, NH8, Delhi- Jaipur Highway, District Alwar, Neemrana, Rajasthan 301705, India; 5Hopp Children's Cancer Center [KiTZ], Heidelberg, Germany; 6Division of Pediatric Neuro Oncology, German Cancer Research Center [DKFZ], German Cancer Consortium [DKTK], Heidelberg, Germany; 7Heidelberg University, Medical Faculty, Heidelberg, Germany; 8Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur302021, RJ, India
| |
Collapse
|
11
|
Chen W, Yao C, Guo Y, Wang Y, Xue Z. pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP. BMC Bioinformatics 2020; 21:426. [PMID: 32993484 PMCID: PMC7526426 DOI: 10.1186/s12859-020-03757-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 09/16/2020] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Structure comparison can provide useful information to identify functional and evolutionary relationship between proteins. With the dramatic increase of protein structure data in the Protein Data Bank, computation time quickly becomes the bottleneck for large scale structure comparisons. To more efficiently deal with informative multiple structure alignment tasks, we propose pmTM-align, a parallel protein structure alignment approach based on mTM-align/TM-align. pmTM-align contains two stages to handle pairwise structure alignments with Spark and the phylogenetic tree-based multiple structure alignment task on a single computer with OpenMP. RESULTS Experiments with the SABmark dataset showed that parallelization along with data structure optimization provided considerable speedup for mTM-align. The Spark-based structure alignments achieved near ideal scalability with large datasets, and the OpenMP-based construction of the phylogenetic tree accelerated the incremental alignment of multiple structures and metrics computation by a factor of about 2-5. CONCLUSIONS pmTM-align enables scalable pairwise and multiple structure alignment computing and offers more timely responses for medium to large-sized input data than existing alignment tools such as mTM-align.
Collapse
Affiliation(s)
- Weiya Chen
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Chun Yao
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yingzhong Guo
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yan Wang
- School of Life Science, Huazhong University of Science and Technology, Wuhan, China
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China.
| |
Collapse
|
12
|
Hansen AW, Murugan M, Li H, Khayat MM, Wang L, Rosenfeld J, Andrews BK, Jhangiani SN, Coban Akdemir ZH, Sedlazeck FJ, Ashley-Koch AE, Liu P, Muzny DM, Davis EE, Katsanis N, Sabo A, Posey JE, Yang Y, Wangler MF, Eng CM, Sutton VR, Lupski JR, Boerwinkle E, Gibbs RA. A Genocentric Approach to Discovery of Mendelian Disorders. Am J Hum Genet 2019; 105:974-986. [PMID: 31668702 PMCID: PMC6849092 DOI: 10.1016/j.ajhg.2019.09.027] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 09/27/2019] [Indexed: 12/20/2022] Open
Abstract
The advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.
Collapse
Affiliation(s)
- Adam W Hansen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael M Khayat
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jill Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - B Kim Andrews
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zeynep H Coban Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Allison E Ashley-Koch
- Duke Molecular Physiology Institute, Duke University Medical Center, Durham, NC 27710, USA; Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Donna M Muzny
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Erica E Davis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Nicholas Katsanis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Aniko Sabo
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yaping Yang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael F Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Christine M Eng
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - V Reid Sutton
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; School of Public Health, UTHealth, Houston, TX 77030, USA
| | - Richard A Gibbs
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
13
|
Shi L, Meng X, Tseng E, Mascagni M, Wang Z. SpaRC: scalable sequence clustering using Apache Spark. Bioinformatics 2018; 35:760-768. [DOI: 10.1093/bioinformatics/bty733] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2018] [Revised: 07/18/2018] [Accepted: 08/21/2018] [Indexed: 01/08/2023] Open
Affiliation(s)
- Lizhen Shi
- Department of Computer Science, School of Computer Science, Florida State University, Tallahassee, FL, USA
| | - Xiandong Meng
- US Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | | | - Michael Mascagni
- Department of Computer Science, School of Computer Science, Florida State University, Tallahassee, FL, USA
| | - Zhong Wang
- US Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- School of Natural Sciences, University of California at Merced, Merced, CA, USA
| |
Collapse
|
14
|
Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. Gigascience 2018; 7:5067872. [PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 07/28/2018] [Indexed: 11/13/2022] Open
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of–the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Collapse
Affiliation(s)
- Runxin Guo
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China
| | - Yi Zhao
- Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China
| | - Quan Zou
- School of Computer Science and Technology, No.135, Yaguan Road, Jinnan District, Tianjin University, Tianjin, 300050, China
| | - Xiaodong Fang
- BGI Genomics, BGI-Shenzhen, No.21, Mingzhu Road, Yantian District, Shenzhen, 518083, China
| | - Shaoliang Peng
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.,College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, No.252, Shannan Road, Yuelu District, Changsha, 410082, China
| |
Collapse
|