1
|
Expósito RR, Martínez-Sánchez M, Touriño J. SparkEC: speeding up alignment-based DNA error correction tools. BMC Bioinformatics 2022; 23:464. [PMID: 36344928 PMCID: PMC9639292 DOI: 10.1186/s12859-022-05013-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 10/26/2022] [Indexed: 11/09/2022] Open
Abstract
Background In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. Results In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× and 11.9\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}×, respectively, over its counterpart. Conclusion As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05013-1.
Collapse
Affiliation(s)
- Roberto R. Expósito
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| | - Marco Martínez-Sánchez
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| | - Juan Touriño
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| |
Collapse
|
2
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
3
|
Zhu F, Zhang F, Hu L, Liu H, Li Y. Integrated Genome and Transcriptome Sequencing to Solve a Neuromuscular Puzzle: Miyoshi Muscular Dystrophy and Early Onset Primary Dystonia in Siblings of the Same Family. Front Genet 2021; 12:672906. [PMID: 34276779 PMCID: PMC8283672 DOI: 10.3389/fgene.2021.672906] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 04/23/2021] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Neuromuscular disorders (NMD), many of which are hereditary, affect muscular function. Due to advances in high-throughput sequencing technologies, the diagnosis of hereditary NMDs has dramatically improved in recent years. METHODS AND RESULTS In this study, we report an family with two siblings exhibiting two different NMD, Miyoshi muscular dystrophy (MMD) and early onset primary dystonia (EOPD). Whole exome sequencing (WES) identified a novel monoallelic frameshift deletion mutation (dysferlin: c.4404delC/p.I1469Sfs∗17) in the Dysferlin gene in the index patient who suffered from MMD. This deletion was inherited from his unaffected father and was carried by his younger sister with EOPD. However, immunostaining staining revealed an absence of dysferlin expression in the proband's muscle tissue and thus suggested the presence of the second underlying mutant allele in dysferlin. Using integrated RNA sequencing (RNA-seq) and whole genome sequencing (WGS) of muscle tissue, a novel deep intronic mutation in dysferlin (dysferlin: c.5341-415A > G) was discovered in the index patient. This mutation caused aberrant mRNA splicing and inclusion of an additional pseudoexon (PE) which we termed PE48.1. This PE was inherited from his unaffected mother. PE48.1 inclusion altered the Dysferlin sequence, causing premature termination of translation. CONCLUSION Using integrated genome and transcriptome sequencing, we discovered hereditary MMD and EOPD affecting two siblings of same family. Our results added further weight to the combined use of RNA-seq and WGS as an important method for detection of deep intronic gene mutations, and suggest that integrated sequencing assays are an effective strategy for the diagnosis of hereditary NMDs.
Collapse
Affiliation(s)
- Feng Zhu
- Department of Cardiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Fengxiao Zhang
- Department of Cardiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Lizhi Hu
- Department of Cardiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Haowen Liu
- Department of Neurology, The Third Hospital of Hebei Medical University, Shijiazhuang, China
| | - Yahua Li
- Department of Respiratory Medicine, The Third Hospital of Hebei Medical University, Shijiazhuang, China
| |
Collapse
|
5
|
Yang A, Kishore A, Phipps B, Ho JWK. Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco. BMC Genomics 2019; 20:927. [PMID: 31888474 PMCID: PMC6936136 DOI: 10.1186/s12864-019-6341-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 11/26/2019] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Read alignment and transcript assembly are the core of RNA-seq analysis for transcript isoform discovery. Nonetheless, current tools are not designed to be scalable for analysis of full-length bulk or single cell RNA-seq (scRNA-seq) data. The previous version of our cloud-based tool Falco only focuses on RNA-seq read counting, but does not allow for more flexible steps such as alignment and read assembly. RESULTS The Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data. There are two new modes in Falco: alignment-only and transcript assembly. In the alignment-only mode, Falco can speed up the alignment process by 2.5-16.4x based on two public scRNA-seq datasets when compared to alignment on a highly optimised standalone computer. Furthermore, it also provides a 10x average speed-up compared to alignment using published cloud-enabled tool for read alignment, Rail-RNA. In the transcript assembly mode, Falco can speed up the transcript assembly process by 1.7-16.5x compared to performing transcript assembly on a highly optimised computer. CONCLUSION Falco is a significantly updated open source big data processing framework that enables scalable and accelerated alignment and assembly of full-length scRNA-seq data on the cloud. The source code can be found at https://github.com/VCCRI/Falco.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia.,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia
| | - Abhinav Kishore
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Benjamin Phipps
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Joshua W K Ho
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia. .,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia. .,School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong, China.
| |
Collapse
|