1
|
Cao S, Li M, Li LM. RegCloser: a robust regression approach to closing genome gaps. BMC Bioinformatics 2023; 24:249. [PMID: 37312038 DOI: 10.1186/s12859-023-05367-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 05/27/2023] [Indexed: 06/15/2023] Open
Abstract
BACKGROUND Closing gaps in draft genomes leads to more complete and continuous genome assemblies. The ubiquitous genomic repeats are challenges to the existing gap-closing methods, based on either the k-mer representation by the de Bruijn graph or the overlap-layout-consensus paradigm. Besides, chimeric reads will cause erroneous k-mers in the former and false overlaps of reads in the latter. RESULTS We propose a novel local assembly approach to gap closing, called RegCloser. It represents read coordinates and their overlaps respectively by parameters and observations in a linear regression model. The optimal overlap is searched only in the restricted range consistent with insert sizes. Under this linear regression framework, the local DNA assembly becomes a robust parameter estimation problem. We solved the problem by a customized robust regression procedure that resists the influence of false overlaps by optimizing a convex global Huber loss function. The global optimum is obtained by iteratively solving the sparse system of linear equations. On both simulated and real datasets, RegCloser outperformed other popular methods in accurately resolving the copy number of tandem repeats, and achieved superior completeness and contiguity. Applying RegCloser to a plateau zokor draft genome that had been improved by long reads further increased contig N50 to 3-fold long. We also tested the robust regression approach on layout generation of long reads. CONCLUSIONS RegCloser is a competitive gap-closing tool. The software is available at https://github.com/csh3/RegCloser . The robust regression approach has a prospect to be incorporated into the layout module of long read assemblers.
Collapse
Affiliation(s)
- Shenghao Cao
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mengtian Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Lei M Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
2
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
3
|
Song W, Zhang S, Thomas T. MarkerMAG: linking metagenome-assembled genomes (MAGs) with 16S rRNA marker genes using paired-end short reads. Bioinformatics 2022; 38:3684-3688. [PMID: 35713513 DOI: 10.1093/bioinformatics/btac398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 06/08/2022] [Accepted: 06/15/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenome-assembled genomes (MAGs) have substantially extended our understanding of microbial functionality. However, 16S rRNA genes, which are commonly used in phylogenetic analysis and environmental surveys, are often missing from MAGs. Here, we developed MarkerMAG, a pipeline that links 16S rRNA genes to MAGs using paired-end sequencing reads. RESULTS Assessment of MarkerMAG on three benchmarking metagenomic datasets with various degrees of complexity shows substantial increases in the number of MAGs with 16S rRNA genes and a 100% assignment accuracy. MarkerMAG also estimates the copy number of 16S rRNA genes in MAGs with high accuracy. Assessments on three real metagenomic datasets demonstrates 1.1- to 14.2-fold increases in the number of MAGs with 16S rRNA genes. We also show that MarkerMAG-improved MAGs increase the accuracy of functional prediction from 16S rRNA gene amplicon data. MarkerMAG is helpful in connecting information in MAG database with those in 16S rRNA databases and surveys and hence contributes to our increasing understanding of microbial diversity, function, and phylogeny. AVAILABILITY MarkerMAG is implemented in Python3 and freely available at https://github.com/songweizhi/MarkerMAG. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Weizhi Song
- Centre for Marine Science & Innovation, University of New South Wales, Sydney, 2052, Australia.,School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, 2052, Australia
| | - Shan Zhang
- Centre for Marine Science & Innovation, University of New South Wales, Sydney, 2052, Australia.,School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, 2052, Australia
| | - Torsten Thomas
- Centre for Marine Science & Innovation, University of New South Wales, Sydney, 2052, Australia.,School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, 2052, Australia
| |
Collapse
|
4
|
Li M, Li LM. RegScaf: a regression approach to scaffolding. Bioinformatics 2022; 38:2675-2682. [PMID: 35561180 PMCID: PMC9326850 DOI: 10.1093/bioinformatics/btac174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/19/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Crucial to the correctness of a genome assembly is the accuracy of the underlying scaffolds that specify the orders and orientations of contigs together with the gap distances between contigs. The current methods construct scaffolds based on the alignments of 'linking' reads against contigs. We found that some 'optimal' alignments are mistaken due to factors such as the contig boundary effect, particularly in the presence of repeats. Occasionally, the incorrect alignments can even overwhelm the correct ones. The detection of the incorrect linking information is challenging in any existing methods. RESULTS In this study, we present a novel scaffolding method RegScaf. It first examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode. The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions. The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances. The results on both synthetic and real datasets demonstrate that RegScaf outperforms some popular scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplified by a real case. Its adaptability to large genomes and TGS long reads is validated as well. AVAILABILITY AND IMPLEMENTATION RegScaf is publicly available at https://github.com/lemontealala/RegScaf.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengtian Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lei M Li
- To whom correspondence should be addressed.
| |
Collapse
|
5
|
Park SY, Jeon J, Kim JA, Jeon MJ, Yu NH, Kim S, Park AR, Kim JC, Lee Y, Kim Y, Choi ED, Jeong MH, Lee YH, Kim S. Draft Genome Sequence of Xylaria grammica EL000614, a Strain Producing Grammicin, a Potent Nematicidal Compound. MYCOBIOLOGY 2021; 49:294-296. [PMID: 34290553 PMCID: PMC8259839 DOI: 10.1080/12298093.2021.1914360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 04/02/2021] [Accepted: 04/05/2021] [Indexed: 06/13/2023]
Abstract
An endolichenic fungus, Xylaria grammica strain EL000614, showed strong nematicidal effects against plant pathogenic nematode, Meloidogyne incognita by producing grammicin. We report genome assembly of X. grammica EL000614 comprised of 25 scaffolds with a total length of 54.73 Mb, N50 of 4.60 Mb, and 99.8% of BUSCO completeness. GC contents of this genome were 44.02%. Gene families associated with biosynthesis of secondary metabolites or regulatory proteins were identified out of 13,730 gene models predicted.
Collapse
Affiliation(s)
- Sook-Young Park
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Jongbum Jeon
- Department of Agricultural Biotechnology, Interdisciplinary Program in Agricultural Genomics, Center for Fungal Genetic Resources, and Center for Fungal Pathogenesis, Seoul National University, Seoul, Korea
| | - Jung A Kim
- Animal Resources Division, National Institute of Biological Resources, Incheon, Korea
| | - Mi Jin Jeon
- Microorganism Resources Division, National Institute of Biological Resources, Incheon, Korea
| | - Nan Hee Yu
- Department of Agricultural Chemistry, Institute of Environmentally Friendly Agriculture, Chonnam National University, Gwangju, Korea
| | - Seulbi Kim
- Department of Agricultural Chemistry, Institute of Environmentally Friendly Agriculture, Chonnam National University, Gwangju, Korea
| | - Ae Ran Park
- Department of Agricultural Chemistry, Institute of Environmentally Friendly Agriculture, Chonnam National University, Gwangju, Korea
| | - Jin-Cheol Kim
- Department of Agricultural Chemistry, Institute of Environmentally Friendly Agriculture, Chonnam National University, Gwangju, Korea
| | - Yerim Lee
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Youngmin Kim
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Eu Ddeum Choi
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Min-Hye Jeong
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Yong-Hwan Lee
- Department of Agricultural Biotechnology, Interdisciplinary Program in Agricultural Genomics, Center for Fungal Genetic Resources, and Center for Fungal Pathogenesis, Seoul National University, Seoul, Korea
| | - Soonok Kim
- Microorganism Resources Division, National Institute of Biological Resources, Incheon, Korea
| |
Collapse
|
6
|
Using genetic markers to identify the origin of illegally traded agarwood-producing Aquilaria sinensis trees. Glob Ecol Conserv 2020. [DOI: 10.1016/j.gecco.2020.e00958] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
7
|
Draft Genome Sequence of Daldinia childiae JS-1345, an Endophytic Fungus Isolated from Stem Tissue of Korean Fir. Microbiol Resour Announc 2020; 9:9/14/e01284-19. [PMID: 32241861 PMCID: PMC7118187 DOI: 10.1128/mra.01284-19] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The fungus Daldinia childiae strain JS-1345, isolated from stem tissue of Abies koreana (Korean fir), has shown strong anti-inflammatory activity. Here, we report the genome sequence of D. childiae JS-1345. The final assembly consisted of 133 scaffolds totaling 38,652,569 bp (G+C content, 44.07%).
Collapse
|
8
|
Wang A, Au KF. Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads. Genome Biol 2020; 21:14. [PMID: 31952552 PMCID: PMC6966875 DOI: 10.1186/s13059-019-1885-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 11/10/2019] [Indexed: 11/10/2022] Open
Abstract
The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.
Collapse
Affiliation(s)
- Anqi Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA.
- Department of Biostatistics, University of Iowa, Iowa City, IA, 52242, USA.
| |
Collapse
|
9
|
Draft Genome Sequence of Aspergillus oryzae BP2-1, Isolated from Traditional Malted Rice in South Korea. Microbiol Resour Announc 2020; 9:9/1/e01405-19. [PMID: 31896653 PMCID: PMC6940305 DOI: 10.1128/mra.01405-19] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
The fungus Aspergillus oryzae strain BP2-1 was isolated from the traditional malted starter culture nuruk. We report here the draft whole-genome sequence of A. oryzae BP2-1, which is comprised of 14 scaffolds with a total length of 39,455,382 bp and a GC content of 47.13%. The fungus Aspergillus oryzae strain BP2-1 was isolated from the traditional malted starter culture nuruk. We report here the draft whole-genome sequence of A. oryzae BP2-1, which is comprised of 14 scaffolds with a total length of 39,455,382 bp and a GC content of 47.13%.
Collapse
|
10
|
Draft Genome Sequence of Amphirosellinia nigrospora JS-1675, an Endophytic Fungus from Pteris cretica. Microbiol Resour Announc 2019; 8:8/20/e00069-19. [PMID: 31097494 PMCID: PMC6522779 DOI: 10.1128/mra.00069-19] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The fungus Amphirosellinia nigrospora strain JS-1675 has been reported to exert antimicrobial effects against various plant-pathogenic bacteria and fungi. Here, we report the draft genome sequence of A. nigrospora for the first time. The fungus Amphirosellinia nigrospora strain JS-1675 has been reported to exert antimicrobial effects against various plant-pathogenic bacteria and fungi. Here, we report the draft genome sequence of A. nigrospora for the first time. The assembly comprises 48,177,783 bp with 18 scaffolds.
Collapse
|
11
|
Kyriakidou M, Tai HH, Anglin NL, Ellis D, Strömvik MV. Current Strategies of Polyploid Plant Genome Sequence Assembly. FRONTIERS IN PLANT SCIENCE 2018; 9:1660. [PMID: 30519250 PMCID: PMC6258962 DOI: 10.3389/fpls.2018.01660] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 10/25/2018] [Indexed: 05/14/2023]
Abstract
Polyploidy or duplication of an entire genome occurs in the majority of angiosperms. The understanding of polyploid genomes is important for the improvement of those crops, which humans rely on for sustenance and basic nutrition. As climate change continues to pose a potential threat to agricultural production, there will increasingly be a demand for plant cultivars that can resist biotic and abiotic stresses and also provide needed and improved nutrition. In the past decade, Next Generation Sequencing (NGS) has fundamentally changed the genomics landscape by providing tools for the exploration of polyploid genomes. Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools in molecular genetics and breeding. As genomes of diploid and less heterozygous progenitor species are increasingly available, we discuss the lack of complexity of these currently available reference genomes as they relate to polyploid crops. Finally, we review recent approaches of haplotyping by phasing and the impact of third generation technologies on polyploid plant genome assembly.
Collapse
Affiliation(s)
- Maria Kyriakidou
- Department of Plant Science, McGill University, Montreal, QC, Canada
| | - Helen H. Tai
- Fredericton Research and Development Centre, Agriculture and Agri-Food Canada, Fredericton, NB, Canada
| | | | | | - Martina V. Strömvik
- Department of Plant Science, McGill University, Montreal, QC, Canada
- *Correspondence: Martina V. Strömvik
| |
Collapse
|