1
|
Luo J, Guan T, Chen G, Yu Z, Zhai H, Yan C, Luo H. SLHSD: hybrid scaffolding method based on short and long reads. Brief Bioinform 2023; 24:7152317. [PMID: 37141142 DOI: 10.1093/bib/bbad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 01/08/2023] [Accepted: 04/12/2023] [Indexed: 05/05/2023] Open
Abstract
In genome assembly, scaffolding can obtain more complete and continuous scaffolds. Current scaffolding methods usually adopt one type of read to construct a scaffold graph and then orient and order contigs. However, scaffolding with the strengths of two or more types of reads seems to be a better solution to some tricky problems. Combining the advantages of different types of data is significant for scaffolding. Here, a hybrid scaffolding method (SLHSD) is present that simultaneously leverages the precision of short reads and the length advantage of long reads. Building an optimal scaffold graph is an important foundation for getting scaffolds. SLHSD uses a new algorithm that combines long and short read alignment information to determine whether to add an edge and how to calculate the edge weight in a scaffold graph. In addition, SLHSD develops a strategy to ensure that edges with high confidence can be added to the graph with priority. Then, a linear programming model is used to detect and remove remaining false edges in the graph. We compared SLHSD with other scaffolding methods on five datasets. Experimental results show that SLHSD outperforms other methods. The open-source code of SLHSD is available at https://github.com/luojunwei/SLHSD.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Ting Guan
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Guolin Chen
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Zhonghua Yu
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Haixia Zhai
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| |
Collapse
|
2
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
3
|
Luo J, Lyu M, Chen R, Zhang X, Luo H, Yan C. SLR: a scaffolding algorithm based on long reads and contig classification. BMC Bioinformatics 2019; 20:539. [PMID: 31666010 PMCID: PMC6820941 DOI: 10.1186/s12859-019-3114-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Accepted: 09/23/2019] [Indexed: 11/10/2022] Open
Abstract
Background Scaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers. However, repetitive regions in contigs usually prevent scaffolding from producing accurate results. How to solve the problem of repetitive regions has received a great deal of attention. In the past few years, long reads sequenced by third-generation sequencing technologies (Pacific Biosciences and Oxford Nanopore) have been demonstrated to be useful for sequencing repetitive regions in genomes. Although some stand-alone scaffolding algorithms based on long reads have been presented, scaffolding still requires a new strategy to take full advantage of the characteristics of long reads. Results Here, we present a new scaffolding algorithm based on long reads and contig classification (SLR). Through the alignment information of long reads and contigs, SLR classifies the contigs into unique contigs and ambiguous contigs for addressing the problem of repetitive regions. Next, SLR uses only unique contigs to produce draft scaffolds. Then, SLR inserts the ambiguous contigs into the draft scaffolds and produces the final scaffolds. We compare SLR to three popular scaffolding tools by using long read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies. The experimental results show that SLR can produce better results in terms of accuracy and completeness. The open-source code of SLR is available at https://github.com/luojunwei/SLR. Conclusion In this paper, we describes SLR, which is designed to scaffold contigs using long reads. We conclude that SLR can improve the completeness of genome assembly.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, 454000, China.
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, 454000, China
| | - Ranran Chen
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, 454000, China
| | - Xiaohong Zhang
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, 454000, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, 475001, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, 475001, China
| |
Collapse
|
4
|
Ghurye J, Treangen T, Fedarko M, Hervey WJ, Pop M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol 2019; 20:174. [PMID: 31451112 PMCID: PMC6710874 DOI: 10.1186/s13059-019-1791-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
Reconstructing genomic segments from metagenomics data is a highly complex task. In addition to general challenges, such as repeats and sequencing errors, metagenomic assembly needs to tolerate the uneven depth of coverage among organisms in a community and differences between nearly identical strains. Previous methods have addressed these issues by smoothing genomic variants. We present a variant-aware metagenomic scaffolder called MetaCarvel, which combines new strategies for repeat detection with graph analytics for the discovery of variants. We show that MetaCarvel can accurately reconstruct genomic segments from complex microbial mixtures and correctly identify and characterize several classes of common genomic variants.
Collapse
Affiliation(s)
- Jay Ghurye
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Todd Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Marcus Fedarko
- Department of Computer Science, University of Maryland, College Park, MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - W Judson Hervey
- Center for Bio/Molecular Science & Engineering, United States Naval Research Laboratory, Washington, DC, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
| |
Collapse
|
5
|
Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019; 10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Benchmarking studies are important for comprehensively understanding and evaluating different computational omics methods. Here, the authors review practices from 25 recent studies and propose principles to improve the quality of benchmarking studies.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Margaret G Distler
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA.,The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|