1
|
Selwyn JD, Vollmer SV. Whole genome assembly and annotation of the endangered Caribbean coral Acropora cervicornis. G3 (BETHESDA, MD.) 2023; 13:jkad232. [PMID: 37804092 PMCID: PMC10700113 DOI: 10.1093/g3journal/jkad232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 09/25/2023] [Accepted: 09/27/2023] [Indexed: 10/08/2023]
Abstract
Coral species in the genus Acropora are key ecological components of coral reefs worldwide and represent the most diverse genus of scleractinian corals. While key species of Indo-Pacific Acropora have annotated genomes, no annotated genome has been published for either of the two species of Caribbean Acropora. Here we present the first fully annotated genome of the endangered Caribbean staghorn coral, Acropora cervicornis. We assembled and annotated this genome using high-fidelity nanopore long-read sequencing with gene annotations validated with mRNA sequencing. The assembled genome size is 318 Mb, with 28,059 validated genes. Comparative genomic analyses with other Acropora revealed unique features in A. cervicornis, including contractions in immune pathways and expansions in signaling pathways. Phylogenetic analysis confirms previous findings showing that A. cervicornis diverged from Indo-Pacific relatives around 41 million years ago, with the closure of the western Tethys Sea, prior to the primary radiation of Indo-Pacific Acropora. This new A. cervicornis genome enriches our understanding of the speciose Acropora and addresses evolutionary inquiries concerning speciation and hybridization in this diverse clade.
Collapse
Affiliation(s)
- Jason D Selwyn
- Department of Marine and Environmental Sciences, Northeastern University, Nahant, MA 01908, USA
| | - Steven V Vollmer
- Department of Marine and Environmental Sciences, Northeastern University, Nahant, MA 01908, USA
| |
Collapse
|
2
|
Vollmer SV, Selwyn JD, Despard BA, Roesel CL. Genomic signatures of disease resistance in endangered staghorn corals. Science 2023; 381:1451-1454. [PMID: 37769073 DOI: 10.1126/science.adi3601] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 08/09/2023] [Indexed: 09/30/2023]
Abstract
White band disease (WBD) has caused unprecedented declines in the Caribbean Acropora corals, which are now listed as critically endangered species. Highly disease-resistant Acropora cervicornis genotypes exist, but the genetic underpinnings of disease resistance are not understood. Using transmission experiments, a newly assembled genome, and whole-genome resequencing of 76 A. cervicornis genotypes from Florida and Panama, we identified 10 genomic regions and 73 single-nucleotide polymorphisms that are associated with disease resistance and that include functional protein-coding changes in four genes involved in coral immunity and pathogen detection. Polygenic scores calculated from 10 genomic loci indicate that genetic screens can detect disease resistance in wild and nursery stocks of A. cervicornis across the Caribbean.
Collapse
Affiliation(s)
- Steven V Vollmer
- Department of Marine and Environmental Sciences, Northeastern University, 430 Nahant Road, Nahant, MA 01908, USA
| | - Jason D Selwyn
- Department of Marine and Environmental Sciences, Northeastern University, 430 Nahant Road, Nahant, MA 01908, USA
| | - Brecia A Despard
- Department of Marine and Environmental Sciences, Northeastern University, 430 Nahant Road, Nahant, MA 01908, USA
| | - Charles L Roesel
- Department of Marine and Environmental Sciences, Northeastern University, 430 Nahant Road, Nahant, MA 01908, USA
| |
Collapse
|
3
|
Guinet B, Lepetit D, Charlat S, Buhl PN, Notton DG, Cruaud A, Rasplus JY, Stigenberg J, de Vienne DM, Boussau B, Varaldi J. Endoparasitoid lifestyle promotes endogenization and domestication of dsDNA viruses. eLife 2023; 12:85993. [PMID: 37278068 DOI: 10.7554/elife.85993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 05/12/2023] [Indexed: 06/07/2023] Open
Abstract
The accidental endogenization of viral elements within eukaryotic genomes can occasionally provide significant evolutionary benefits, giving rise to their long-term retention, that is, to viral domestication. For instance, in some endoparasitoid wasps (whose immature stages develop inside their hosts), the membrane-fusion property of double-stranded DNA viruses have been repeatedly domesticated following ancestral endogenizations. The endogenized genes provide female wasps with a delivery tool to inject virulence factors that are essential to the developmental success of their offspring. Because all known cases of viral domestication involve endoparasitic wasps, we hypothesized that this lifestyle, relying on a close interaction between individuals, may have promoted the endogenization and domestication of viruses. By analyzing the composition of 124 Hymenoptera genomes, spread over the diversity of this clade and including free-living, ecto, and endoparasitoid species, we tested this hypothesis. Our analysis first revealed that double-stranded DNA viruses, in comparison with other viral genomic structures (ssDNA, dsRNA, ssRNA), are more often endogenized and domesticated (that is, retained by selection) than expected from their estimated abundance in insect viral communities. Second, our analysis indicates that the rate at which dsDNA viruses are endogenized is higher in endoparasitoids than in ectoparasitoids or free-living hymenopterans, which also translates into more frequent events of domestication. Hence, these results are consistent with the hypothesis that the endoparasitoid lifestyle has facilitated the endogenization of dsDNA viruses, in turn, increasing the opportunities of domestications that now play a central role in the biology of many endoparasitoid lineages.
Collapse
Affiliation(s)
- Benjamin Guinet
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| | - David Lepetit
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| | - Sylvain Charlat
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| | - Peter N Buhl
- Zoological Museum, Department of Entomology, University of Copenhagen, Universitetsparken, Copenhagen, Denmark
| | - David G Notton
- Natural Sciences Department, National Museums Collection Centre, Edinburgh, United Kingdom
| | - Astrid Cruaud
- INRAE, UMR 1062 CBGP, 755 avenue 11 du campus Agropolis CS 30016, 34988, Montferrier-sur-Lez, France
| | - Jean-Yves Rasplus
- INRAE, UMR 1062 CBGP, 755 avenue 11 du campus Agropolis CS 30016, 34988, Montferrier-sur-Lez, France
| | - Julia Stigenberg
- Department of Zoology, Swedish Museum of Natural History, Stockholm, Sweden
| | - Damien M de Vienne
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| | - Bastien Boussau
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| | - Julien Varaldi
- Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, F-69622, Villeurbanne, France
| |
Collapse
|
4
|
Lai S, Pan S, Sun C, Coelho LP, Chen WH, Zhao XM. metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol 2022; 23:242. [PMID: 36376928 PMCID: PMC9661791 DOI: 10.1186/s13059-022-02810-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC ( https://github.com/ZhaoXM-Lab/metaMIC ), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
Collapse
Affiliation(s)
- Senying Lai
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
- College of Life Science, Henan Normal University, Xinxiang, Henan China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
5
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
6
|
Liao X, Li M, Luo J, Zou Y, Wu FX, Luo F, Wang J. EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1492-1503. [PMID: 31603794 DOI: 10.1109/tcbb.2019.2945761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Assembling genomes from single-cell sequencing data is essential for single-cell studies. However, single-cell assemblies are challenging due to (i) the highly non-uniform read coverage and (ii) the elevated levels of sequencing errors and chimeric reads. Although several assemblers for single-cell data have been proposed in recent years, most of them fail to construct correct long contigs. In this study, we present a new framework called EPGA-SC for de novo assembly of single-cell sequencing reads. The EPGA assembler has designed strategies to solve the problems caused by sequencing errors, sequencing biases, and repetitive regions. However, the extremely unbalanced and richer error types prevent EPGA to achieve high performance in single-cell sequencing data. In this study, we designed EPGA-SC based on EPGA. The main innovations of EPGA-SC are as follows: (i) classifying reads to reduce the proportion of false reads; (ii) using multiple sets of high precision paired-end reads generated from the high precision assemblies produced by other assembler such as SPAdes to overcome the impact of sequencing biases and repetitive regions; and (iii) developing novel algorithms for removing chimeric errors and extending contigs. We test EPGA-SC with seven datasets. The experimental results show that EPGA-SC can generate better assemblies than most current tools in most time in term of MAX contig, N50, NG50, NA50, and NGA50.
Collapse
|
7
|
Zhang Z, Luo J, Shang J, Li M, Wu FX, Pan Y, Wang J. Deletion Detection Method Using the Distribution of Insert Size and a Precise Alignment Strategy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1070-1081. [PMID: 31403441 DOI: 10.1109/tcbb.2019.2934407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Homozygous and heterozygous deletions commonly exist in the human genome. For current structural variation detection tools, it is significant to determine whether a deletion is homozygous or heterozygous. However, the problems of sequencing errors, micro-homologies, and micro-insertions prohibit common alignment tools from identifying accurate breakpoint locations, and often result in detecting false structural variations. In this study, we present a novel deletion detection tool called Sprites2. Comparing with Sprites, Sprites2 makes the following modifications: (1) The distribution of insert size is used in Sprites2, which can identify the type of deletions and improve the accuracy of deletion calls. (2) A precise alignment method based on AGE (one algorithm simultaneously aligning 5' and 3' ends between two sequences) is adopted in Sprites2 to identify breakpoints, which is helpful to resolve the problems introduced by sequencing errors, micro-homologies, and micro-insertions. In order to test and verify the performance of Sprites2, some simulated and real datasets are adopted in our experiments, and Sprites2 is compared with five popular tools. The experimental results show that Sprites2 can improve the performance of deletion detection. Sprites2 can be downloaded from https://github.com/zhangzhen/sprites2.
Collapse
|
8
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
9
|
Liao X, Gao X, Zhang X, Wu FX, Wang J. RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads. BMC Bioinformatics 2020; 21:463. [PMID: 33076827 PMCID: PMC7574428 DOI: 10.1186/s12859-020-03779-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 09/24/2020] [Indexed: 11/16/2022] Open
Abstract
Background Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
Collapse
Affiliation(s)
- Xingyu Liao
- School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China.
| | - Xin Gao
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xiankai Zhang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China
| | - Fang-Xiang Wu
- Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China
| |
Collapse
|
10
|
Liao X, Li M, Zou Y, Wu FX, Pan Y, Wang J. An Efficient Trimming Algorithm based on Multi-Feature Fusion Scoring Model for NGS Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:728-738. [PMID: 30736001 DOI: 10.1109/tcbb.2019.2897558] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Next-generation sequencing (NGS) has enabled an exponential growth rate of sequencing data. However, several sequence artifacts, including error reads (base calling errors and small insertions or deletions) and poor quality reads, which can impose significant impact on the downstream sequence processing and analysis. Here, we present PE-Trimmer, a sensitive and special trimming algorithm for NGS sequence. First, PE-Trimmer removes technical sequences in paired-end reads based on the characteristics of low quality reads in NGS data. Second, PE-Trimmer determines the range of reads that need to be trimmed according to the quality score statistics histogram of reads in the library. To improve the accuracy of this algorithm, we design a light-weight and easy-to-explain scoring model to evaluate candidates in the pattern of trimming step. Finally, PE-Trimmer selects the appropriate trimming strategy to process the low quality reads based on the location determined by the scoring model. PE-Trimmer is able to locate and remove adapter residues from the paired-end reads. It is easily configurable and offers superior throughput in the multi-threaded mode. We test PE-Trimmer on five datasets, and compare it with the current five latest methods. The experimental results demonstrate that PE-Trimmer produces more superior results, compared with other trimmers.
Collapse
|
11
|
Luo Y, Liao X, Wu FX, Wang J. Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190410155603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Transcriptome assembly plays a critical role in studying biological properties and
examining the expression levels of genomes in specific cells. It is also the basis of many
downstream analyses. With the increase of speed and the decrease in cost, massive sequencing
data continues to accumulate. A large number of assembly strategies based on different
computational methods and experiments have been developed. How to efficiently perform
transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the
issues with transcriptome assembly are explored based on different sequencing technologies.
Specifically, transcriptome assemblies with next-generation sequencing reads are divided into
reference-based assemblies and de novo assemblies. The examples of different species are used to
illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength
transcripts without assemblies. In addition, different transcriptome assemblies using the
Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions
of transcriptome assemblies.
Collapse
Affiliation(s)
- Yuwen Luo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xingyu Liao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
12
|
Tang L, Li M, Wu FX, Pan Y, Wang J. MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification. Front Genet 2020; 10:1396. [PMID: 32082361 PMCID: PMC7005248 DOI: 10.3389/fgene.2019.01396] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Accepted: 12/20/2019] [Indexed: 12/13/2022] Open
Abstract
With the generation of a large amount of sequencing data, different assemblers have emerged to perform de novo genome assembly. As a single strategy is hard to fit various biases of datasets, none of these tools outperforms the others on all species. The process of assembly reconciliation is to merge multiple assemblies and generate a high-quality consensus assembly. Several assembly reconciliation tools have been proposed. However, the existing reconciliation tools cannot produce a merged assembly which has better contiguity and contains less errors simultaneously, and the results of these tools usually depend on the ranking of input assemblies. In this study, we propose a novel assembly reconciliation tool MAC, which merges assemblies by using the adjacency algebraic model and classification. In order to solve the problem of uneven sequencing depth and sequencing errors, MAC identifies consensus blocks between contig sets to construct an adjacency graph. To solve the problem of repetitive region, MAC employs classification to optimize the adjacency algebraic model. What's more, MAC designs an overall scoring function to solve the problem of unknown ranking of input assembly sets. The experimental results from four species of GAGE-B demonstrate that MAC outperforms other assembly reconciliation tools.
Collapse
Affiliation(s)
- Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- School of Computer Science and Engineering, Central South University, Changsha, China
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada
| | - Yi Pan
- School of Computer Science and Engineering, Central South University, Changsha, China
- Department of Computer Science, Georgia State University, Atlanta, GA, United States
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
13
|
|
14
|
A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads. Genes (Basel) 2019; 10:genes10010044. [PMID: 30646604 PMCID: PMC6356754 DOI: 10.3390/genes10010044] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Revised: 01/07/2019] [Accepted: 01/08/2019] [Indexed: 11/19/2022] Open
Abstract
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.
Collapse
|