1
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
2
|
Onyshchenko A, Roberts WR, Ruck EC, Lewis JA, Alverson AJ. The genome of a nonphotosynthetic diatom provides insights into the metabolic shift to heterotrophy and constraints on the loss of photosynthesis. THE NEW PHYTOLOGIST 2021; 232:1750-1764. [PMID: 34379807 PMCID: PMC9292941 DOI: 10.1111/nph.17673] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 08/03/2021] [Indexed: 05/05/2023]
Abstract
Although most of the tens of thousands of diatom species are photoautotrophs, a small number of heterotrophic species no longer photosynthesize. We sequenced the genome of a nonphotosynthetic diatom, Nitzschia Nitz4, to determine how carbon metabolism was altered in the wake of this trophic shift. Nitzschia Nitz4 has retained its plastid and plastid genome, but changes associated with the transition to heterotrophy were cellular-wide and included losses of photosynthesis-related genes from the nuclear and plastid genomes, elimination of isoprenoid biosynthesis in the plastid, and remodeling of mitochondrial glycolysis to maximize adenosine triphosphte (ATP) yield. The genome contains a β-ketoadipate pathway that may allow Nitzschia Nitz4 to metabolize lignin-derived compounds. Diatom plastids lack an oxidative pentose phosphate pathway (oPPP), leaving photosynthesis as the primary source of NADPH to support essential biosynthetic pathways in the plastid and, by extension, limiting available sources of NADPH in nonphotosynthetic plastids. The genome revealed similarities between nonphotosynthetic diatoms and apicomplexan parasites for provisioning NADPH in their plastids and highlighted the ancestral absence of a plastid oPPP as a potentially important constraint on loss of photosynthesis, a hypothesis supported by the higher frequency of transitions to parasitism or heterotrophy in lineages that have a plastid oPPP.
Collapse
Affiliation(s)
- Anastasiia Onyshchenko
- Department of Biological SciencesUniversity of Arkansas1 University of ArkansasFayettevilleAR72701USA
| | - Wade R. Roberts
- Department of Biological SciencesUniversity of Arkansas1 University of ArkansasFayettevilleAR72701USA
| | - Elizabeth C. Ruck
- Department of Biological SciencesUniversity of Arkansas1 University of ArkansasFayettevilleAR72701USA
| | - Jeffrey A. Lewis
- Department of Biological SciencesUniversity of Arkansas1 University of ArkansasFayettevilleAR72701USA
| | - Andrew J. Alverson
- Department of Biological SciencesUniversity of Arkansas1 University of ArkansasFayettevilleAR72701USA
| |
Collapse
|
3
|
Zhang X, Ping P, Hutvagner G, Blumenstein M, Li J. Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach. Nucleic Acids Res 2021; 49:e106. [PMID: 34291293 PMCID: PMC8631080 DOI: 10.1093/nar/gkab610] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 07/01/2021] [Accepted: 07/06/2021] [Indexed: 12/21/2022] Open
Abstract
Raw sequencing reads of miRNAs contain machine-made substitution errors, or even insertions and deletions (indels). Although the error rate can be low at 0.1%, precise rectification of these errors is critically important because isoform variation analysis at single-base resolution such as novel isomiR discovery, editing events understanding, differential expression analysis, or tissue-specific isoform identification is very sensitive to base positions and copy counts of the reads. Existing error correction methods do not work for miRNA sequencing data attributed to miRNAs’ length and per-read-coverage properties distinct from DNA or mRNA sequencing reads. We present a novel lattice structure combining kmers, (k – 1)mers and (k + 1)mers to address this problem. The method is particularly effective for the correction of indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every-1300-reads containing one error. Studies on experimental miRNA sequencing datasets show that the errors are often rectified at the 5′ ends and the seed regions of the reads, and that there are remarkable changes after the correction in miRNA isoform abundance, volume of singleton reads, overall entropy, isomiR families, tissue-specific miRNAs, and rare-miRNA quantities.
Collapse
Affiliation(s)
- Xuan Zhang
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Pengyao Ping
- Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - Jinyan Li
- To whom correspondence should be addressed. Tel: +61 295149264; Fax: +61 295149264;
| |
Collapse
|
4
|
Zhang X, Liu Y, Yu Z, Blumenstein M, Hutvagner G, Li J. Instance-based error correction for short reads of disease-associated genes. BMC Bioinformatics 2021; 22:142. [PMID: 34078284 PMCID: PMC8170817 DOI: 10.1186/s12859-021-04058-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. RESULTS To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). CONCLUSION Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.
Collapse
Affiliation(s)
- Xuan Zhang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, China
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Gyorgy Hutvagner
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia.
| |
Collapse
|
5
|
Hashemi S, Hashemi SE, Lien KM, Lamb JJ. Molecular Microbial Community Analysis as an Analysis Tool for Optimal Biogas Production. Microorganisms 2021; 9:microorganisms9061162. [PMID: 34071282 PMCID: PMC8226781 DOI: 10.3390/microorganisms9061162] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 05/17/2021] [Accepted: 05/26/2021] [Indexed: 11/16/2022] Open
Abstract
The microbial diversity in anaerobic digestion (AD) is important because it affects process robustness. High-throughput sequencing offers high-resolution data regarding the microbial diversity and robustness of biological systems including AD; however, to understand the dynamics of microbial processes, knowing the microbial diversity is not adequate alone. Advanced meta-omic techniques have been established to determine the activity and interactions among organisms in biological processes like AD. Results of these methods can be used to identify biomarkers for AD states. This can aid a better understanding of system dynamics and be applied to producing comprehensive models for AD. The paper provides valuable knowledge regarding the possibility of integration of molecular methods in AD. Although meta-genomic methods are not suitable for on-line use due to long operating time and high costs, they provide extensive insight into the microbial phylogeny in AD. Meta-proteomics can also be explored in the demonstration projects for failure prediction. However, for these methods to be fully realised in AD, a biomarker database needs to be developed.
Collapse
Affiliation(s)
- Seyedbehnam Hashemi
- Department of Energy and Process Engineering & Enersense, Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (S.H.); (S.E.H.); (K.M.L.)
| | - Sayed Ebrahim Hashemi
- Department of Energy and Process Engineering & Enersense, Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (S.H.); (S.E.H.); (K.M.L.)
| | - Kristian M. Lien
- Department of Energy and Process Engineering & Enersense, Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (S.H.); (S.E.H.); (K.M.L.)
| | - Jacob J. Lamb
- Department of Energy and Process Engineering & Enersense, Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (S.H.); (S.E.H.); (K.M.L.)
- Department of Electronic Systems, Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway
- Correspondence:
| |
Collapse
|
6
|
Ge J, Meng J, Guo N, Wei Y, Balaji P, Feng S. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci 2019; 12:99-108. [PMID: 31734873 DOI: 10.1007/s12539-019-00348-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 08/19/2019] [Accepted: 10/25/2019] [Indexed: 11/25/2022]
Abstract
Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.
Collapse
Affiliation(s)
- Jianqiu Ge
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Jintao Meng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Ning Guo
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China.
| | - Pavan Balaji
- Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439-4844, USA
| | - Shengzhong Feng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| |
Collapse
|
7
|
Heydari M, Miclotte G, Van de Peer Y, Fostier J. Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinformatics 2019; 20:298. [PMID: 31159722 PMCID: PMC6545690 DOI: 10.1186/s12859-019-2906-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 05/17/2019] [Indexed: 11/10/2022] Open
Abstract
Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mahdi Heydari
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Giles Miclotte
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium
| | - Yves Van de Peer
- Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.,Center for Plant Systems Biology, VIB, Ghent, B-9052, Belgium.,Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, B-9052, Belgium.,Department of Genetics, Genome Research Institute, University of Pretoria, Pretoria, South Africa
| | - Jan Fostier
- Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. .,Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
| |
Collapse
|
8
|
Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics 2017; 18:374. [PMID: 28821237 PMCID: PMC5563063 DOI: 10.1186/s12859-017-1784-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 08/11/2017] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
Collapse
|
9
|
Dlugosz M, Deorowicz S. RECKONER: read error corrector based on KMC. Bioinformatics 2017; 33:1086-1089. [PMID: 28062451 DOI: 10.1093/bioinformatics/btw746] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 11/24/2016] [Indexed: 11/12/2022] Open
Abstract
Summary Presence of sequencing errors in data produced by next-generation sequencers affects quality of downstream analyzes. Accuracy of them can be improved by performing error correction of sequencing reads. We introduce a new correction algorithm capable of processing eukaryotic close to 500 Mbp-genome-size, high error-rated data using less than 4 GB of RAM in about 35 min on 16-core computer. Availability and Implementation Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner . Contact sebastian.deorowicz@polsl.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
11
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|