1
|
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JDJ, Liu Q, Fan X, Li C, Wang C, Shi T. Single-cell omics: experimental workflow, data analyses and applications. SCIENCE CHINA. LIFE SCIENCES 2024:10.1007/s11427-023-2561-0. [PMID: 39060615 DOI: 10.1007/s11427-023-2561-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 04/18/2024] [Indexed: 07/28/2024]
Abstract
Cells are the fundamental units of biological systems and exhibit unique development trajectories and molecular features. Our exploration of how the genomes orchestrate the formation and maintenance of each cell, and control the cellular phenotypes of various organismsis, is both captivating and intricate. Since the inception of the first single-cell RNA technology, technologies related to single-cell sequencing have experienced rapid advancements in recent years. These technologies have expanded horizontally to include single-cell genome, epigenome, proteome, and metabolome, while vertically, they have progressed to integrate multiple omics data and incorporate additional information such as spatial scRNA-seq and CRISPR screening. Single-cell omics represent a groundbreaking advancement in the biomedical field, offering profound insights into the understanding of complex diseases, including cancers. Here, we comprehensively summarize recent advances in single-cell omics technologies, with a specific focus on the methodology section. This overview aims to guide researchers in selecting appropriate methods for single-cell sequencing and related data analysis.
Collapse
Affiliation(s)
- Fengying Sun
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China
| | - Haoyan Li
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Dongqing Sun
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Shaliu Fu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Lei Gu
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China
| | - Qinqin Wang
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Bin Duan
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Feiyang Xing
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Jun Wu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Minmin Xiao
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China.
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China.
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Chen Li
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Chenfei Wang
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.
| | - Tieliu Shi
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
2
|
Lee GY, Ham S, Lee SJV. Brief guide to RNA sequencing analysis for nonexperts in bioinformatics. Mol Cells 2024; 47:100060. [PMID: 38614390 PMCID: PMC11091515 DOI: 10.1016/j.mocell.2024.100060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 04/08/2024] [Accepted: 04/08/2024] [Indexed: 04/15/2024] Open
Abstract
Transcriptome analysis is widely used for current biological research but remains challenging for many experimental scientists. Here, we present a brief but broad guideline for transcriptome analysis, focusing on RNA sequencing, by providing the list of publicly available datasets, tools, and R packages for practical transcriptome analysis. This work will be useful for biologists to perform key transcriptomic analysis with minimum expertise in bioinformatics.
Collapse
Affiliation(s)
- Gee-Yoon Lee
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Yuseong-gu 34141, Daejeon, South Korea
| | - Seokjin Ham
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Yuseong-gu 34141, Daejeon, South Korea
| | - Seung-Jae V Lee
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Yuseong-gu 34141, Daejeon, South Korea.
| |
Collapse
|
3
|
Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res 2023; 33:1188-1197. [PMID: 37399256 PMCID: PMC10538364 DOI: 10.1101/gr.277679.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 06/26/2023] [Indexed: 07/05/2023]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
Collapse
Affiliation(s)
- Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, 75015 Paris, France
| |
Collapse
|
4
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
5
|
Wei ZG, Fan XG, Zhang H, Zhang XD, Liu F, Qian Y, Zhang SW. kngMap: Sensitive and Fast Mapping Algorithm for Noisy Long Reads Based on the K-Mer Neighborhood Graph. Front Genet 2022; 13:890651. [PMID: 35601495 PMCID: PMC9117619 DOI: 10.3389/fgene.2022.890651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 04/07/2022] [Indexed: 11/13/2022] Open
Abstract
With the rapid development of single molecular sequencing (SMS) technologies such as PacBio single-molecule real-time and Oxford Nanopore sequencing, the output read length is continuously increasing, which has dramatical potentials on cutting-edge genomic applications. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. However, these long reads contain higher sequencing errors and could more frequently span the breakpoints of structural variants (SVs) than those of shorter reads, leading to many unaligned reads or reads that are partially aligned for most state-of-the-art mappers. As a result, these methods usually focus on producing local mapping results for the query read rather than obtaining the whole end-to-end alignment. We introduce kngMap, a novel k-mer neighborhood graph-based mapper that is specifically designed to align long noisy SMS reads to a reference sequence. By benchmarking exhaustive experiments on both simulated and real-life SMS datasets to assess the performance of kngMap with ten other popular SMS mapping tools (e.g., BLASR, BWA-MEM, and minimap2), we demonstrated that kngMap has higher sensitivity that can align more reads and bases to the reference genome; meanwhile, kngMap can produce consecutive alignments for the whole read and span different categories of SVs in the reads. kngMap is implemented in C++ and supports multi-threading; the source code of kngMap can be downloaded for free at: https://github.com/zhang134/kngMap for academic usage.
Collapse
Affiliation(s)
- Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Xing-Guo Fan
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Hao Zhang
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Xiao-Dan Zhang
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Yu Qian
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
- *Correspondence: Yu Qian, ; Shao-Wu Zhang,
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
- *Correspondence: Yu Qian, ; Shao-Wu Zhang,
| |
Collapse
|
6
|
Edgar RC, Taylor B, Lin V, Altman T, Barbera P, Meleshko D, Lohr D, Novakovsky G, Buchfink B, Al-Shayeb B, Banfield JF, de la Peña M, Korobeynikov A, Chikhi R, Babaian A. Petabase-scale sequence alignment catalyses viral discovery. Nature 2022; 602:142-147. [PMID: 35082445 DOI: 10.1038/s41586-021-04332-2] [Citation(s) in RCA: 163] [Impact Index Per Article: 81.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Accepted: 12/10/2021] [Indexed: 01/20/2023]
Abstract
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
Collapse
Affiliation(s)
| | - Brie Taylor
- Independent researcher, Vancouver, British Columbia, Canada
| | - Victor Lin
- Independent researcher, Seattle, WA, USA
| | | | - Pierre Barbera
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, St Petersburg State University, St Petersburg, Russia
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, NY, USA
| | | | - Gherman Novakovsky
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Benjamin Buchfink
- Computational Biology Group, Max Planck Institute for Biology, Tübingen, Germany
| | - Basem Al-Shayeb
- Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Jillian F Banfield
- Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, CA, USA
| | - Marcos de la Peña
- Instituto de Biología Molecular y Celular de Plantas, Universidad Politécnica de Valencia-CSIC, Valencia, Spain
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, St Petersburg State University, St Petersburg, Russia
- Department of Statistical Modelling, St Petersburg State University, St Petersburg, Russia
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| | - Artem Babaian
- Independent researcher, Vancouver, British Columbia, Canada.
| |
Collapse
|
7
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
8
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
9
|
Pablo RGJ, Roberto DP, Victor SU, Isabel GR, Paul C, Elizabeth OR. Big data in the healthcare system: a synergy with artificial intelligence and blockchain technology. J Integr Bioinform 2021; 19:jib-2020-0035. [PMID: 34412176 PMCID: PMC9135137 DOI: 10.1515/jib-2020-0035] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 07/23/2021] [Indexed: 12/19/2022] Open
Abstract
In the last decades big data has facilitating and improving our daily duties in the medical research and clinical fields; the strategy to get to this point is understanding how to organize and analyze the data in order to accomplish the final goal that is improving healthcare system, in terms of cost and benefits, quality of life and outcome patient. The main objective of this review is to illustrate the state-of-art of big data in healthcare, its features and architecture. We also would like to demonstrate the different application and principal mechanisms of big data in the latest technologies known as blockchain and artificial intelligence, recognizing their benefits and limitations. Perhaps, medical education and digital anatomy are unexplored fields that might be profitable to investigate as we are proposing. The healthcare system can be revolutionized using these different technologies. Thus, we are explaining the basis of these systems focused to the medical arena in order to encourage medical doctors, nurses, biotechnologies and other healthcare professions to be involved and create a more efficient and efficacy system.
Collapse
Affiliation(s)
- Reyes-González Juan Pablo
- Department of Radiology, Hospital Angeles del Pedregal, Mexico City, Mexico.,Department of Technology Innovation, hdm.world, Florida, USA
| | | | - Soto-Ulloa Victor
- Department of Technology Innovation, hdm.world, Florida, USA.,Emergency Department, Hospital General #48, Instituto Mexicano del Seguro Social, Mexico City, México
| | - Galvan-Remigio Isabel
- Department of Technology Innovation, hdm.world, Florida, USA.,College of Medicine, Universidad Nacional Autonoma de Mexico, Mexico City, Mexico
| | - Castillo Paul
- Division of Pediatric Hematology Oncology, Department of Pediatrics, University of Florida, Gainesville, FL, USA
| | - Ogando-Rivas Elizabeth
- Department of Technology Innovation, hdm.world, Florida, USA.,Department of Neurosurgery, Brain Tumor Immunotherapy Program, McKnight Brain Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
10
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
11
|
Robinson T, Harkin J, Shukla P. Hardware Acceleration of Genomics Data Analysis: Challenges and Opportunities. Bioinformatics 2021; 37:1785-1795. [PMID: 34037688 PMCID: PMC8317111 DOI: 10.1093/bioinformatics/btab017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 11/03/2020] [Accepted: 05/24/2021] [Indexed: 12/11/2022] Open
Abstract
The significant decline in the cost of genome sequencing has dramatically changed the typical bioinformatics pipeline for analysing sequencing data. Where traditionally, the computational challenge of sequencing is now secondary to genomic data analysis. Short read alignment (SRA) is a ubiquitous process within every modern bioinformatics pipeline in the field of genomics and is often regarded as the principal computational bottleneck. Many hardware and software approaches have been provided to solve the challenge of acceleration. However, previous attempts to increase throughput using many-core processing strategies have enjoyed limited success, mainly due to a dependence on global memory for each computational block. The limited scalability and high energy costs of many-core SRA implementations pose a significant constraint in maintaining acceleration. The Networks-On-Chip (NoC) hardware interconnect mechanism has advanced the scalability of many-core computing systems and, more recently, has demonstrated potential in SRA implementations by integrating multiple computational blocks such as pre-alignment filtering and sequence alignment efficiently, while minimising memory latency and global memory access. This paper provides a state of the art review on current hardware acceleration strategies for genomic data analysis, and it establishes the challenges and opportunities of utilising NoCs as a critical building block in next-generation sequencing (NGS) technologies for advancing the speed of analysis.
Collapse
Affiliation(s)
- Tony Robinson
- School of Computing, Engineering and Intelligent Systems, Ulster University, Magee Campus, Derry/Londonderry, BT48 7JL, UK
| | - Jim Harkin
- School of Computing, Engineering and Intelligent Systems, Ulster University, Magee Campus, Derry/Londonderry, BT48 7JL, UK
| | - Priyank Shukla
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, Ulster University, C-TRIC Building, Altnagelvin Area Hospital, Derry/Londonderry, BT47 6SB, UK
| |
Collapse
|
12
|
GPrimer: a fast GPU-based pipeline for primer design for qPCR experiments. BMC Bioinformatics 2021; 22:220. [PMID: 33926379 PMCID: PMC8082839 DOI: 10.1186/s12859-021-04133-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 04/14/2021] [Indexed: 11/10/2022] Open
Abstract
Background Design of valid high-quality primers is essential for qPCR experiments. MRPrimer is a powerful pipeline based on MapReduce that combines both primer design for target sequences and homology tests on off-target sequences. It takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB. Due to the effectiveness of primers designed by MRPrimer in qPCR analysis, it has been widely used for developing many online design tools and building primer databases. However, the computational speed of MRPrimer is too slow to deal with the sizes of sequence DBs growing exponentially and thus must be improved. Results We develop a fast GPU-based pipeline for primer design (GPrimer) that takes the same input and returns the same output with MRPrimer. MRPrimer consists of a total of seven MapReduce steps, among which two steps are very time-consuming. GPrimer significantly improves the speed of those two steps by exploiting the computational power of GPUs. In particular, it designs data structures for coalesced memory access in GPU and workload balancing among GPU threads and copies the data structures between main memory and GPU memory in a streaming fashion. For human RefSeq DB, GPrimer achieves a speedup of 57 times for the entire steps and a speedup of 557 times for the most time-consuming step using a single machine of 4 GPUs, compared with MRPrimer running on a cluster of six machines. Conclusions We propose a GPU-based pipeline for primer design that takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB at once without an additional step using BLAST-like tools. The software is available at https://github.com/qhtjrmin/GPrimer.git.
Collapse
|
13
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
14
|
Pal S, Mondal S, Das G, Khatua S, Ghosh Z. Big data in biology: The hope and present-day challenges in it. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100869] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
15
|
AlJame M, Ahmad I. DNA short read alignment on apache spark. APPLIED COMPUTING AND INFORMATICS 2020. [DOI: 10.1016/j.aci.2019.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The evolution of technologies has unleashed a wealth of challenges by generating massive amount of data. Recently, biological data has increased exponentially, which has introduced several computational challenges. DNA short read alignment is an important problem in bioinformatics. The exponential growth in the number of short reads has increased the need for an ideal platform to accelerate the alignment process. Apache Spark is a cluster-computing framework that involves data parallelism and fault tolerance. In this article, we proposed a Spark-based algorithm to accelerate DNA short reads alignment problem, and it is called Spark-DNAligning. Spark-DNAligning exploits Apache Spark ’s performance optimizations such as broadcast variable, join after partitioning, caching, and in-memory computations. Spark-DNAligning is evaluated in term of performance by comparing it with SparkBWA tool and a MapReduce based algorithm called CloudBurst. All the experiments are conducted on Amazon Web Services (AWS). Results demonstrate that Spark-DNAligning outperforms both tools by providing a speedup in the range of 101–702 in aligning gigabytes of short reads to the human genome. Empirical evaluation reveals that Apache Spark offers promising solutions to DNA short reads alignment problem.
Collapse
|
16
|
Fernandes M, Decouchant J, Volp M, Couto FM, Esteves-Verissimo P. DNA-SeAl: Sensitivity Levels to Optimize the Performance of Privacy-Preserving DNA Alignment. IEEE J Biomed Health Inform 2020; 24:907-915. [DOI: 10.1109/jbhi.2019.2914952] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
17
|
Morrow AK, He GZ, Nothaft FA, Tu ET, Paschall J, Yosef N, Joseph AD. Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets. Cell Syst 2019; 9:609-613.e3. [PMID: 31812694 DOI: 10.1016/j.cels.2019.11.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Revised: 12/04/2018] [Accepted: 11/04/2019] [Indexed: 11/25/2022]
Abstract
The decreasing cost of DNA sequencing over the past decade has led to an explosion of sequencing datasets, leaving us with petabytes of data to analyze. However, current sequencing visualization tools are designed to run on single machines, which limits their scalability and interactivity on modern genomic datasets. Here, we leverage the scalability of Apache Spark to provide Mango, consisting of a Jupyter notebook and genome browser, which removes scalability and interactivity constraints by leveraging multi-node compute clusters to allow interactive analysis over terabytes of sequencing data. We demonstrate scalability of the Mango tools by performing quality control analyses on 10 terabytes of 100 high-coverage sequencing samples from the Simons Genome Diversity Project, enabling capability for interactive genomic exploration of multi-sample datasets that surpass the computational limitations of single-node visualization tools. Mango is freely available for download with full documentation at https://bdg-mango.readthedocs.io/en/latest/.
Collapse
Affiliation(s)
- Alyssa Kramer Morrow
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA.
| | - George Zhixuan He
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Harvard Law School, 1563 Massachusetts Avenue, Cambridge, MA 02138, USA; Google, 355 Main St, Cambridge, MA 02142, USA
| | - Frank Austin Nothaft
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Databricks, Inc., 160 Spear Street, 13th Floor, San Francisco, CA 94105, USA
| | - Eric Tongching Tu
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; The Boeing Company, 1950 E Imperial Hwy, El Segundo, CA 90245-2701, USA
| | - Justin Paschall
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - Nir Yosef
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Center for Computational Biology, University of California Berkeley, 108 Stanley Hall, Berkeley, CA 94720-3220, USA
| | - Anthony Douglas Joseph
- Electrical Engineering and Computer Science Department, University of California Berkeley, 465 Soda Hall, Berkeley, CA 94720-1776, USA; Center for Computational Biology, University of California Berkeley, 108 Stanley Hall, Berkeley, CA 94720-3220, USA; Unite Genomics, Inc., 1301 Marina Village Pkwy, Suite 320, Alameda, CA 94501, USA
| |
Collapse
|
18
|
Lightbody G, Haberland V, Browne F, Taggart L, Zheng H, Parkes E, Blayney JK. Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application. Brief Bioinform 2019; 20:1795-1811. [PMID: 30084865 PMCID: PMC6917217 DOI: 10.1093/bib/bby051] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 05/01/2018] [Indexed: 12/28/2022] Open
Abstract
There has been an exponential growth in the performance and output of sequencing technologies (omics data) with full genome sequencing now producing gigabases of reads on a daily basis. These data may hold the promise of personalized medicine, leading to routinely available sequencing tests that can guide patient treatment decisions. In the era of high-throughput sequencing (HTS), computational considerations, data governance and clinical translation are the greatest rate-limiting steps. To ensure that the analysis, management and interpretation of such extensive omics data is exploited to its full potential, key factors, including sample sourcing, technology selection and computational expertise and resources, need to be considered, leading to an integrated set of high-performance tools and systems. This article provides an up-to-date overview of the evolution of HTS and the accompanying tools, infrastructure and data management approaches that are emerging in this space, which, if used within in a multidisciplinary context, may ultimately facilitate the development of personalized medicine.
Collapse
Affiliation(s)
- Gaye Lightbody
- School of Computing, Ulster University, Newtownabbey, UK
| | - Valeriia Haberland
- MRC Integrative Epidemiology Unit, Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Fiona Browne
- School of Computing, Ulster University, Newtownabbey, UK
| | | | - Huiru Zheng
- School of Computing, Ulster University, Newtownabbey, UK
| | - Eileen Parkes
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| | - Jaine K Blayney
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| |
Collapse
|
19
|
Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2019; 20:1542-1559. [PMID: 29617724 PMCID: PMC6781587 DOI: 10.1093/bib/bby017] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/06/2018] [Indexed: 02/06/2023] Open
Abstract
Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.
Collapse
Affiliation(s)
- Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeremie S Kim
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| | - Saugata Ghose
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara, Turkey
| | - Onur Mutlu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Computer Science, Systems Group, ETH Zürich, Zürich, Switzerland
| |
Collapse
|
20
|
Abstract
Enterprise cloud bus (ECBS) is a multi-agent-based abstraction layer framework, responsible for publishing and discovery of services in an Inter-cloud environment. Our work focuses on the service discovery model (HBSD) using Hadoop that leads to the challenges of automatic web service discovery patterns. It has been observed that the RDBMS can handle only data sizes up to a few Terabytes but fails to scale beyond that, so Apache Hadoop can be used for parallel processing of massive datasets. This article provides a novel Hadoop based Service Discovery (HBSD) approach that can handle vast amount of datasets generated from heterogeneous cloud services. The novelty of the proposed architecture coordinates cloud participants, automate service registration pattern, reconfigure discover services and focus on aggregating heterogeneous services from Inter-cloud environments. Moreover, this particle states a novel and efficient algorithm (HBSDMCA) for finding the appropriate service as per user's requirements that can provide higher QoS to the user request for web services.
Collapse
Affiliation(s)
- Gitosree Khan
- B. P. Poddar Institute of Management and Technology, Kolkata, India
| | | | - Sabnam Sengupta
- B. P. Poddar Institute of Management and Technology, Kolkata, India
| |
Collapse
|
21
|
Huang L, Krüger J, Sczyrba A. Analyzing large scale genomic data on the cloud with Sparkhit. Bioinformatics 2019; 34:1457-1465. [PMID: 29253074 PMCID: PMC5925781 DOI: 10.1093/bioinformatics/btx808] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 12/14/2017] [Indexed: 12/26/2022] Open
Abstract
Motivation The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92-157 times faster than MetaSpark on metagenomic fragment recruitment and 18-32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. Availability and implementation Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact asczyrba@cebitec.uni-bielefeld.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Liren Huang
- Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.,Center for Biotechnology - CeBiTec, Bielefeld University, Bielefeld 33615, Germany.,Computational Methods for the Analysis of the Diversity and Dynamics of Genomes, Bielefeld University, Bielefeld 33615, Germany
| | - Jan Krüger
- Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.,Center for Biotechnology - CeBiTec, Bielefeld University, Bielefeld 33615, Germany
| | - Alexander Sczyrba
- Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.,Center for Biotechnology - CeBiTec, Bielefeld University, Bielefeld 33615, Germany.,Computational Methods for the Analysis of the Diversity and Dynamics of Genomes, Bielefeld University, Bielefeld 33615, Germany
| |
Collapse
|
22
|
Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience 2019; 8:5266304. [PMID: 30597002 PMCID: PMC6354030 DOI: 10.1093/gigascience/giy165] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 12/17/2018] [Indexed: 11/23/2022] Open
Abstract
Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Collapse
Affiliation(s)
- Illyoung Choi
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Alise J Ponsero
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Matthew Bomhoff
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Ken Youens-Clark
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - John H Hartman
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Bonnie L Hurwitz
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.,BIO5 Institute, University of Arizona, 1657 E. Helen Street, Tucson, Arizona, 85719, USA
| |
Collapse
|
23
|
Abstract
Background:
One of the pivotal challenges in nowadays genomic research domain is the fast
processing of voluminous data such as the ones engendered by high-throughput Next-Generation
Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished
and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard.
Objective:
To improve the performance of BLAST in the processing of voluminous data, we have
applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data.
Method:
We have used a master-worker model for the processing of voluminous data alongside a
memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for
each worker, and consequently each worker further splits and formats its allocated data chunk according
to the size of its memory. Each worker searches every split data one-by-one through a list of queries.
Results:
We have chosen a list of queries with different lengths to run insensitive searches in a huge
database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance
when workers used our proposed memory-aware technique compared to when they were not memory
aware. Comparatively, experiments show even higher performance improvement, approximately 50
percent, when we applied our memory-aware technique to mpiBLAST.
Conclusion:
We have shown that memory-awareness in formatting bulky database, when running
BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory
environments. Even though distributed computing attempts to mitigate search time by partitioning and
distributing database portions, our memory-aware technique alleviates negative effects of page-faults on
performance.
Collapse
Affiliation(s)
- Majid Hajibaba
- Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology, Tehran, Iran
| | - Mohsen Sharifi
- School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
| | - Saeid Gorgin
- Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology, Tehran, Iran
| |
Collapse
|
24
|
Choi J, Park Y, Kim S, Chae H. Cloud-BS: A MapReduce-based bisulfite sequencing aligner on cloud. J Bioinform Comput Biol 2018; 16:1840028. [PMID: 30567473 DOI: 10.1142/s0219720018400280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In recent years, there have been many studies utilizing DNA methylome data to answer fundamental biological questions. Bisulfite sequencing (BS-seq) has enabled measurement of a genome-wide absolute level of DNA methylation at single-nucleotide resolution. However, due to the ambiguity introduced by bisulfite-treatment, the aligning process especially in large-scale epigenetic research is still considered a huge burden. We present Cloud-BS, an efficient BS-seq aligner designed for parallel execution on a distributed environment. Utilizing Apache Hadoop framework, Cloud-BS splits sequencing reads into multiple blocks and transfers them to distributed nodes. By designing each aligning procedure into separate map and reducing tasks while an internal key-value structure is optimized based on the MapReduce programming model, the algorithm significantly improves alignment performance without sacrificing mapping accuracy. In addition, Cloud-BS minimizes the innate burden of configuring a distributed environment by providing a pre-configured cloud image. Cloud-BS shows significantly improved bisulfite alignment performance compared to other existing BS-seq aligners. We believe our algorithm facilitates large-scale methylome data analysis. The algorithm is freely available at https://paryoja.github.io/Cloud-BS/ .
Collapse
Affiliation(s)
- Joungmin Choi
- * Division of Computer Science, Sookmyung Women's University, 100 Cheongpa-ro 47-gil, 04310 Seoul, Republic of Korea
| | - Yoonjae Park
- † Department of Electrical and Computer Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, 08826 Seoul, Republic of Korea
| | - Sun Kim
- ‡ Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, 08826 Seoul, Republic of Korea
| | - Heejoon Chae
- * Division of Computer Science, Sookmyung Women's University, 100 Cheongpa-ro 47-gil, 04310 Seoul, Republic of Korea
| |
Collapse
|
25
|
Banegas-Luna AJ, Imbernón B, Llanes Castro A, Pérez-Garrido A, Cerón-Carrasco JP, Gesing S, Merelli I, D'Agostino D, Pérez-Sánchez H. Advances in distributed computing with modern drug discovery. Expert Opin Drug Discov 2018; 14:9-22. [PMID: 30484337 DOI: 10.1080/17460441.2019.1552936] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
INTRODUCTION Computational chemistry dramatically accelerates the drug discovery process and high-performance computing (HPC) can be used to speed up the most expensive calculations. Supporting a local HPC infrastructure is both costly and time-consuming, and, therefore, many research groups are moving from in-house solutions to remote-distributed computing platforms. Areas covered: The authors focus on the use of distributed technologies, solutions, and infrastructures to gain access to HPC capabilities, software tools, and datasets to run the complex simulations required in computational drug discovery (CDD). Expert opinion: The use of computational tools can decrease the time to market of new drugs. HPC has a crucial role in handling the complex algorithms and large volumes of data required to achieve specificity and avoid undesirable side-effects. Distributed computing environments have clear advantages over in-house solutions in terms of cost and sustainability. The use of infrastructures relying on virtualization reduces set-up costs. Distributed computing resources can be difficult to access, although web-based solutions are becoming increasingly available. There is a trade-off between cost-effectiveness and accessibility in using on-demand computing resources rather than free/academic resources. Graphics processing unit computing, with its outstanding parallel computing power, is becoming increasingly important.
Collapse
Affiliation(s)
- Antonio Jesús Banegas-Luna
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Baldomero Imbernón
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Antonio Llanes Castro
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Alfonso Pérez-Garrido
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - José Pedro Cerón-Carrasco
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Sandra Gesing
- b Center for Research Computing , University of Notre Dame , Notre Dame , IN , USA
| | - Ivan Merelli
- c Institute for Biomedical Technologies , National Research Council of Italy , Segrate (Milan) , Italy
| | - Daniele D'Agostino
- d Institute for Applied Mathematics and Information Technologies "E. Magenes" , National Research Council of Italy , Genoa , Italy
| | - Horacio Pérez-Sánchez
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| |
Collapse
|
26
|
Soe S, Park Y, Chae H. BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data. BMC Bioinformatics 2018; 19:472. [PMID: 30526492 PMCID: PMC6288881 DOI: 10.1186/s12859-018-2498-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2017] [Accepted: 11/16/2018] [Indexed: 04/06/2024] Open
Abstract
BACKGROUND Bisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses. RESULTS In this study, we present a highly scalable, efficient, and load-balanced bisulfite aligner, BiSpark, which is designed for processing large volumes of bisulfite sequencing data. We implemented the BiSpark algorithm over the Apache Spark, a memory optimized distributed data processing framework, to achieve the maximum data parallel efficiency. The BiSpark algorithm is designed to support redistribution of imbalanced data to minimize delays on large-scale distributed environment. CONCLUSIONS Experimental results on methylome datasets show that BiSpark significantly outperforms other state-of-the-art bisulfite sequencing aligners in terms of alignment speed and scalability with respect to dataset size and a number of computing nodes while providing highly consistent and comparable mapping results. AVAILABILITY The implementation of BiSpark software package and source code is available at https://github.com/bhi-kimlab/BiSpark/ .
Collapse
Affiliation(s)
- Seokjun Soe
- Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Yoonjae Park
- Seoul National University, Seoul, Republic of Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women’s University, Seoul, Republic of Korea
| |
Collapse
|
27
|
Alnasir JJ, Shanahan HP. The application of Hadoop in structural bioinformatics. Brief Bioinform 2018; 21:96-105. [PMID: 30462158 DOI: 10.1093/bib/bby106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 09/20/2018] [Accepted: 10/05/2018] [Indexed: 11/13/2022] Open
Abstract
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
Collapse
Affiliation(s)
- Jamie J Alnasir
- Institute of Cancer Research, Old Brompton Road, London, United Kingdom
| | - Hugh P Shanahan
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| |
Collapse
|
28
|
Mondal S, Maji RK, Ghosh Z, Khatua S. ParStream-seq: An improved method of handling next generation sequence data. Genomics 2018; 111:1641-1650. [PMID: 30448525 DOI: 10.1016/j.ygeno.2018.11.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 11/11/2018] [Accepted: 11/12/2018] [Indexed: 10/27/2022]
Abstract
The exponential growth of next generation sequencing (NGS) data has put forward the challenge for its storage as well as its efficient and faster analysis. Storing the entire amount of data for a particular experiment and its alignment to the reference genome is an essential step for any quantitative analysis of NGS data. Here, we introduce streaming access technique 'ParStream-seq' that splits the bulk sequence data, accessed from a remote repository into short manageable packets followed by executing their alignment process in parallel in each of the compute core. The optimal packet size with fixed number of reads is determined in the stream that maximizes system utilization. Result shows a reduction in the execution time and improvement in the memory footprint. Overall, this streaming access technique provides means to overcome the hurdle of storing the entire volume of sequence data corresponding to a particular experiment, prior to its analysis.
Collapse
Affiliation(s)
- Sudip Mondal
- Department of Computer Science and Engineering, University of Calcutta, Kolkata, India
| | | | - Zhumur Ghosh
- Bioinformatics Center, Bose Institute, Kolkata, India
| | - Sunirmal Khatua
- Department of Computer Science and Engineering, University of Calcutta, Kolkata, India.
| |
Collapse
|
29
|
Dharanipragada P, Seelam SR, Parekh N. SeqVItA: Sequence Variant Identification and Annotation Platform for Next Generation Sequencing Data. Front Genet 2018; 9:537. [PMID: 30487811 PMCID: PMC6247818 DOI: 10.3389/fgene.2018.00537] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 10/23/2018] [Indexed: 12/20/2022] Open
Abstract
The current trend in clinical data analysis is to understand how individuals respond to therapies and drug interactions based on their genetic makeup. This has led to a paradigm shift in healthcare; caring for patients is now 99% information and 1% intervention. Reducing costs of next generation sequencing (NGS) technologies has made it possible to take genetic profiling to the clinical setting. This requires not just fast and accurate algorithms for variant detection, but also a knowledge-base for variant annotation and prioritization to facilitate tailored therapeutics based on an individual's genetic profile. Here we show that it is possible to provide a fast and easy access to all possible information about a variant and its impact on the gene, its protein product, associated pathways and drug-variant interactions by integrating previously reported knowledge from various databases. With this objective, we have developed a pipeline, Sequence Variants Identification and Annotation (SeqVItA) that provides end-to-end solution for small sequence variants detection, annotation and prioritization on a single platform. Parallelization of the variant detection step and with numerous resources incorporated to infer functional impact, clinical relevance and drug-variant associations, SeqVItA will benefit the clinical and research communities alike. Its open-source platform and modular framework allows for easy customization of the workflow depending on the data type (single, paired, or pooled samples), variant type (germline and somatic), and variant annotation and prioritization. Performance comparison of SeqVItA on simulated data and detection, interpretation and analysis of somatic variants on real data (24 liver cancer patients) is carried out. We demonstrate the efficacy of annotation module in facilitating personalized medicine based on patient's mutational landscape. SeqVItA is freely available at https://bioinf.iiit.ac.in/seqvita.
Collapse
Affiliation(s)
- Prashanthi Dharanipragada
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Sampreeth Reddy Seelam
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| | - Nita Parekh
- Center for Computational Natural Science and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
30
|
Parallel MapReduce: Maximizing Cloud Resource Utilization and Performance Improvement Using Parallel Execution Strategies. BIOMED RESEARCH INTERNATIONAL 2018; 2018:7501042. [PMID: 30417014 PMCID: PMC6207866 DOI: 10.1155/2018/7501042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Accepted: 09/30/2018] [Indexed: 01/05/2023]
Abstract
MapReduce is the preferred cloud computing framework used in large data analysis and application processing. MapReduce frameworks currently in place suffer performance degradation due to the adoption of sequential processing approaches with little modification and thus exhibit underutilization of cloud resources. To overcome this drawback and reduce costs, we introduce a Parallel MapReduce (PMR) framework in this paper. We design a novel parallel execution strategy of Map and Reduce worker nodes. Our strategy enables further performance improvement and efficient utilization of cloud resources execution of Map and Reduce functions to utilize multicore environments available with computing nodes. We explain in detail makespan modeling and working principle of the PMR framework in the paper. Performance of PMR is compared with Hadoop through experiments considering three biomedical applications. Experiments conducted for BLAST, CAP3, and DeepBind biomedical applications report makespan time reduction of 38.92%, 18.00%, and 34.62% considering the PMR framework against Hadoop framework. Experiments' results prove that the PMR cloud computing platform proposed is robust, cost-effective, and scalable, which sufficiently supports diverse applications on public and private cloud platforms. Consequently, overall presentation and results indicate that there is good matching between theoretical makespan modeling presented and experimental values investigated.
Collapse
|
31
|
Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. Gigascience 2018; 7:5067872. [PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 07/28/2018] [Indexed: 11/13/2022] Open
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of–the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Collapse
Affiliation(s)
- Runxin Guo
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China
| | - Yi Zhao
- Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China
| | - Quan Zou
- School of Computer Science and Technology, No.135, Yaguan Road, Jinnan District, Tianjin University, Tianjin, 300050, China
| | - Xiaodong Fang
- BGI Genomics, BGI-Shenzhen, No.21, Mingzhu Road, Yantian District, Shenzhen, 518083, China
| | - Shaoliang Peng
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.,College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, No.252, Shannan Road, Yuelu District, Changsha, 410082, China
| |
Collapse
|
32
|
HSRA: Hadoop-based spliced read aligner for RNA sequencing data. PLoS One 2018; 13:e0201483. [PMID: 30063721 PMCID: PMC6067734 DOI: 10.1371/journal.pone.0201483] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 07/16/2018] [Indexed: 01/18/2023] Open
Abstract
Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user’s guide are publicly available for download at http://hsra.dec.udc.es.
Collapse
|
33
|
Lladós J, Cores F, Guirado F. Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database. J Comput Biol 2018; 25:894-906. [PMID: 30004242 DOI: 10.1089/cmb.2018.0084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing, also known as high-throughput sequencing, has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, such as MAFFT, ProbCons, and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, such methods are severely limited by the memory required to store the consistency information. Big data processing and persistence techniques are used to manage and store the huge amount of information that is generated. Although these techniques have significant advantages, few biological applications have adopted them. In this article, a novel approach named big data tree-based consistency objective function for alignment evaluation (BDT-Coffee) is presented. BDT-Coffee is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.
Collapse
Affiliation(s)
- Jordi Lladós
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | - Fernando Cores
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | | |
Collapse
|
34
|
Abstract
Biomedical research has become a digital data–intensive endeavor, relying on secure and scalable computing, storage, and network infrastructure, which has traditionally been purchased, supported, and maintained locally. For certain types of biomedical applications, cloud computing has emerged as an alternative to locally maintained traditional computing approaches. Cloud computing offers users pay-as-you-go access to services such as hardware infrastructure, platforms, and software for solving common biomedical computational problems. Cloud computing services offer secure on-demand storage and analysis and are differentiated from traditional high-performance computing by their rapid availability and scalability of services. As such, cloud services are engineered to address big data problems and enhance the likelihood of data and analytics sharing, reproducibility, and reuse. Here, we provide an introductory perspective on cloud computing to help the reader determine its value to their own research.
Collapse
|
35
|
Sun X, Gao J, Jin P, Eng C, Burchard EG, Beaty TH, Ruczinski I, Mathias RA, Barnes K, Wang F, Qin ZS. Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files. Gigascience 2018; 7:4995263. [PMID: 29762754 PMCID: PMC6007233 DOI: 10.1093/gigascience/giy052] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Revised: 02/06/2018] [Accepted: 05/05/2018] [Indexed: 12/24/2022] Open
Abstract
Background Sorted merging of genomic data is a common data operation necessary in many sequencing-based studies. It involves sorting and merging genomic data from different subjects by their genomic locations. In particular, merging a large number of variant call format (VCF) files is frequently required in large-scale whole-genome sequencing or whole-exome sequencing projects. Traditional single-machine based methods become increasingly inefficient when processing large numbers of files due to the excessive computation time and Input/Output bottleneck. Distributed systems and more recent cloud-based systems offer an attractive solution. However, carefully designed and optimized workflow patterns and execution plans (schemas) are required to take full advantage of the increased computing power while overcoming bottlenecks to achieve high performance. Findings In this study, we custom-design optimized schemas for three Apache big data platforms, Hadoop (MapReduce), HBase, and Spark, to perform sorted merging of a large number of VCF files. These schemas all adopt the divide-and-conquer strategy to split the merging job into sequential phases/stages consisting of subtasks that are conquered in an ordered, parallel, and bottleneck-free way. In two illustrating examples, we test the performance of our schemas on merging multiple VCF files into either a single TPED or a single VCF file, which are benchmarked with the traditional single/parallel multiway-merge methods, message passing interface (MPI)-based high-performance computing (HPC) implementation, and the popular VCFTools. Conclusions Our experiments suggest all three schemas either deliver a significant improvement in efficiency or render much better strong and weak scalabilities over traditional methods. Our findings provide generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.
Collapse
Affiliation(s)
- Xiaobo Sun
- Department of Computer Sciences, Emory University, Atlanta, GA 30322, USA
| | - Jingjing Gao
- Department of Medical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Peng Jin
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, San Francisco, CA 94143 USA
| | - Esteban G Burchard
- Department of Medicine, University of California, San Francisco, San Francisco, CA 94143 USA
| | - Terri H Beaty
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA
| | - Ingo Ruczinski
- Department of Biostatistics, Bloomberg School of Public Health, JHU, Baltimore, MD 21205 USA
| | - Rasika A Mathias
- Department of Medicine, Johns Hopkins University, Baltimore, MD 21224 USA
| | - Kathleen Barnes
- Department of Medicine, University of Colorado, Denver, Aurora, CO, 80045 USA
| | - Fusheng Wang
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Zhaohui S Qin
- Department of Medical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA
- Department of Biostatistics, Emory University, Atlanta, GA 30322, USA
| | | |
Collapse
|
36
|
Eslami T, Saeed F. Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study. High Throughput 2018; 7:E11. [PMID: 29677161 PMCID: PMC6023306 DOI: 10.3390/ht7020011] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Revised: 04/04/2018] [Accepted: 04/17/2018] [Indexed: 11/16/2022] Open
Abstract
Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
Collapse
Affiliation(s)
- Taban Eslami
- Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA.
| | - Fahad Saeed
- Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA.
| |
Collapse
|
37
|
Abstract
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
Collapse
Affiliation(s)
- Ben Langmead
- Department of Computer Science, Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Department of Surgery, Computational Biology Program, Oregon Health and Science University, Portland, OR, USA
| |
Collapse
|
38
|
Lin HN, Hsu WL. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 2018; 33:2281-2287. [PMID: 28379292 PMCID: PMC5860120 DOI: 10.1093/bioinformatics/btx189] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 04/05/2017] [Indexed: 02/02/2023] Open
Abstract
Motivation Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. Results We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. Availability and Implementation Kart is available at https://github.com/hsinnan75/Kart/ . Contact hsu@iis.sinica.edu.tw. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
39
|
Samaddar S, Sinha R, De RK. A MODEL for DISTRIBUTED PROCESSING and ANALYSES of NGS DATA under MAP-REDUCE PARADIGM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:827-840. [PMID: 29993814 DOI: 10.1109/tcbb.2018.2816022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Massively parallel sequencing technique, introduced by NGS technology, has resulted in an exponential growth of sequencing data, with greatly reduced cost and increased throughput. This huge explosion of data has introduced new challenges in regard to its storage, integration, processing and analyses. In this paper, we have proposed a novel distributed model under Map-Reduce paradigm to address the NGS big data problem. The architecture of the model involves Map-Reduce based modularized approach involving 3 different phases that support various analytical pipelines. The first phase will generate detailed base level information of various individual genomes, by granulating the alignment data. The other 2 phases independently process this base level information in parallel. One of these 2 phases will provide an integrated DNA profile of multiple individuals, whereas the other phase will generate contigs with similar features in an individual. Each of these 2 phases will generate a repository of genomic information that will facilitate other analytical pipelines. A simulated and real experimental prototypes has been provided as results to show the effectiveness of the model and its superiority over a few existing popular models and tools. A detailed description of the scope of applications of this model is also included in this article.
Collapse
|
40
|
Chang P, Gohain M, Yen MR, Chen PY. Computational Methods for Assessing Chromatin Hierarchy. Comput Struct Biotechnol J 2018; 16:43-53. [PMID: 29686798 PMCID: PMC5910504 DOI: 10.1016/j.csbj.2018.02.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Revised: 01/29/2018] [Accepted: 02/11/2018] [Indexed: 12/27/2022] Open
Abstract
The hierarchical organization of chromatin is known to associate with diverse cellular functions; however, the precise mechanisms and the 3D structure remain to be determined. With recent advances in high-throughput next generation sequencing (NGS) techniques, genome-wide profiling of chromatin structures is made possible. Here, we provide a comprehensive overview of NGS-based methods for profiling "higher-order" and "primary-order" chromatin structures from both experimental and computational aspects. Experimental requirements and considerations specific for each method were highlighted. For computational analysis, we summarized a common analysis strategy for both levels of chromatin assessment, focusing on the characteristic computing steps and the tools. The recently developed single-cell level techniques based on Hi-C and ATAC-seq present great potential to reveal cell-to-cell variability in chromosome architecture. A brief discussion on these methods in terms of experimental and data analysis features is included. We also touch upon the biological relevance of chromatin organization and how the combination with other techniques uncovers the underlying mechanisms. We conclude with a summary and our prospects on necessary improvements of currently available methods in order to advance understanding of chromatin hierarchy. Our review brings together the analyses of both higher- and primary-order chromatin structures, and serves as a roadmap when choosing appropriate experimental and computational methods for assessing chromatin hierarchy.
Collapse
Affiliation(s)
- Pearl Chang
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| | - Moloya Gohain
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| | - Ming-Ren Yen
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| | - Pao-Yang Chen
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
41
|
Assay for Transposase Accessible Chromatin (ATAC-Seq) to Chart the Open Chromatin Landscape of Human Pancreatic Islets. Methods Mol Biol 2018; 1766:197-208. [PMID: 29605854 DOI: 10.1007/978-1-4939-7768-0_11] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The regulatory mechanisms that ensure an accurate control of gene transcription are central to cellular function, development and disease. Such mechanisms rely largely on noncoding regulatory sequences that allow the establishment and maintenance of cell identity and tissue-specific cellular functions.The study of chromatin structure and nucleosome positioning allowed revealing transcription factor accessible genomic sites with regulatory potential, facilitating the comprehension of tissue-specific cis-regulatory networks. Recently a new technique coupled with high-throughput sequencing named Assay for Transposase Accessible Chromatin (ATAC-seq) emerged as an efficient method to chart open chromatin genome wide. The application of such technique to different cell types allowed unmasking tissue-specific regulatory elements and characterizing cis-regulatory networks. Herein we describe the implementation of the ATAC-seq method to human pancreatic islets, a tissue playing a central role in the control of glucose metabolism.
Collapse
|
42
|
Abstract
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made.
Collapse
Affiliation(s)
- Xiuhua Si
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Qian Wang
- b Shandong Aerospace Institute of Electronic Technology , Yantai , China
| | - Lei Zhang
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Ruo Wu
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Jiquan Ma
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| |
Collapse
|
43
|
Al Kawam A, Khatri S, Datta A. A Survey of Software and Hardware Approaches to Performing Read Alignment in Next Generation Sequencing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1202-1213. [PMID: 27362989 DOI: 10.1109/tcbb.2016.2586070] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Computational genomics is an emerging field that is enabling us to reveal the origins of life and the genetic basis of diseases such as cancer. Next Generation Sequencing (NGS) technologies have unleashed a wealth of genomic information by producing immense amounts of raw data. Before any functional analysis can be applied to this data, read alignment is applied to find the genomic coordinates of the produced sequences. Alignment algorithms have evolved rapidly with the advancement in sequencing technology, striving to achieve biological accuracy at the expense of increasing space and time complexities. Hardware approaches have been proposed to accelerate the computational bottlenecks created by the alignment process. Although several hardware approaches have achieved remarkable speedups, most have overlooked important biological features, which have hampered their widespread adoption by the genomics community. In this paper, we provide a brief biological introduction to genomics and NGS. We discuss the most popular next generation read alignment tools and algorithms. Furthermore, we provide a comprehensive survey of the hardware implementations used to accelerate these algorithms.
Collapse
|
44
|
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/978-3-319-65482-9_45] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2023]
|
45
|
Mashl RJ, Scott AD, Huang KL, Wyczalkowski MA, Yoon CJ, Niu B, DeNardo E, Yellapantula VD, Handsaker RE, Chen K, Koboldt DC, Ye K, Fenyö D, Raphael BJ, Wendl MC, Ding L. GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 2017; 27:1450-1459. [PMID: 28522612 PMCID: PMC5538560 DOI: 10.1101/gr.211656.116] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 05/03/2017] [Indexed: 12/12/2022]
Abstract
Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional “download and analyze” paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.
Collapse
Affiliation(s)
- R Jay Mashl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Adam D Scott
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Kuan-Lin Huang
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | | | - Christopher J Yoon
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Beifang Niu
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Erin DeNardo
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Venkata D Yellapantula
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Robert E Handsaker
- Stanley Center for Psychiatric Research, Broad Institute, Cambridge, Massachusetts 02142, USA.,Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Daniel C Koboldt
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Kai Ye
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - David Fenyö
- Langone Medical Center, New York University, New York, New York 10016, USA
| | - Benjamin J Raphael
- Department of Computer Science and Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA
| | - Michael C Wendl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Department of Mathematics, Washington University, St. Louis, Missouri 63108, USA
| | - Li Ding
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Siteman Cancer Center, Washington University, St. Louis, Missouri 63108, USA
| |
Collapse
|
46
|
Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce. PLoS One 2017; 12:e0174575. [PMID: 28358893 PMCID: PMC5373595 DOI: 10.1371/journal.pone.0174575] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 03/10/2017] [Indexed: 12/30/2022] Open
Abstract
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Collapse
Affiliation(s)
- Dries Decap
- Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | - Joke Reumers
- Janssen Research & Development, a division of Janssen Pharmaceutica N.V., Beerse, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | | | - Pascal Costanza
- Intel Corporation Belgium, Leuven, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | - Jan Fostier
- Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium
- ExaScience Life Lab, Leuven, Belgium
- * E-mail:
| |
Collapse
|
47
|
MapReduce Algorithms for Inferring Gene Regulatory Networks from Time-Series Microarray Data Using an Information-Theoretic Approach. BIOMED RESEARCH INTERNATIONAL 2017; 2017:6261802. [PMID: 28243601 PMCID: PMC5294223 DOI: 10.1155/2017/6261802] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Revised: 11/14/2016] [Accepted: 12/13/2016] [Indexed: 12/15/2022]
Abstract
Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to expeditiously analyze copious amounts of experimental data resulting from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here, we propose new MapReduce algorithms for inferring gene regulatory networks on a Hadoop cluster in a cloud environment. These algorithms employ an information-theoretic approach to infer GRNs using time-series microarray data. Experimental results show that our MapReduce program is much faster than an existing tool while achieving slightly better prediction accuracy than the existing tool.
Collapse
|
48
|
Canzar S, Salzberg SL. Short Read Mapping: An Algorithmic Tour. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2017; 105:436-458. [PMID: 28502990 PMCID: PMC5425171 DOI: 10.1109/jproc.2015.2455551] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads', that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurments, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3 billion base pair long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.
Collapse
|
49
|
He KY, Ge D, He MM. Big Data Analytics for Genomic Medicine. Int J Mol Sci 2017; 18:ijms18020412. [PMID: 28212287 PMCID: PMC5343946 DOI: 10.3390/ijms18020412] [Citation(s) in RCA: 104] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Revised: 02/08/2017] [Accepted: 02/09/2017] [Indexed: 12/25/2022] Open
Abstract
Genomic medicine attempts to build individualized strategies for diagnostic or therapeutic decision-making by utilizing patients’ genomic information. Big Data analytics uncovers hidden patterns, unknown correlations, and other insights through examining large-scale various data sets. While integration and manipulation of diverse genomic data and comprehensive electronic health records (EHRs) on a Big Data infrastructure exhibit challenges, they also provide a feasible opportunity to develop an efficient and effective approach to identify clinically actionable genetic variants for individualized diagnosis and therapy. In this paper, we review the challenges of manipulating large-scale next-generation sequencing (NGS) data and diverse clinical data derived from the EHRs for genomic medicine. We introduce possible solutions for different challenges in manipulating, managing, and analyzing genomic and clinical data to implement genomic medicine. Additionally, we also present a practical Big Data toolset for identifying clinically actionable genetic variants using high-throughput NGS data and EHRs.
Collapse
Affiliation(s)
- Karen Y He
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA.
| | | | - Max M He
- BioSciKin Co., Ltd., Nanjing 210042, China.
- Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
50
|
Cattaneo G, Giancarlo R, Piotto S, Ferraro Petrillo U, Roscigno G, Di Biasi L. MapReduce in Computational Biology - A Synopsis. ADVANCES IN ARTIFICIAL LIFE, EVOLUTIONARY COMPUTATION, AND SYSTEMS CHEMISTRY 2017. [DOI: 10.1007/978-3-319-57711-1_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|