1
|
Nkurikiyimfura O, Waheed A, Fang H, Yuan X, Chen L, Wang YP, Lu G, Zhan J, Yang L. Fitness difference between two synonymous mutations of Phytophthora infestans ATP6 gene. BMC Ecol Evol 2024; 24:36. [PMID: 38494489 PMCID: PMC10946160 DOI: 10.1186/s12862-024-02223-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 03/11/2024] [Indexed: 03/19/2024] Open
Abstract
BACKGROUND Sequence variation produced by mutation provides the ultimate source of natural selection for species adaptation. Unlike nonsynonymous mutation, synonymous mutations are generally considered to be selectively neutral but accumulating evidence suggests they also contribute to species adaptation by regulating the flow of genetic information and the development of functional traits. In this study, we analysed sequence characteristics of ATP6, a housekeeping gene from 139 Phytophthora infestans isolates, and compared the fitness components including metabolic rate, temperature sensitivity, aggressiveness, and fungicide tolerance among synonymous mutations. RESULTS We found that the housekeeping gene exhibited low genetic variation and was represented by two major synonymous mutants at similar frequency (0.496 and 0.468, respectively). The two synonymous mutants were generated by a single nucleotide substitution but differed significantly in fitness as well as temperature-mediated spatial distribution and expression. The synonymous mutant ending in AT was more common in cold regions and was more expressed at lower experimental temperature than the synonymous mutant ending in GC and vice versa. CONCLUSION Our results are consistent with the argument that synonymous mutations can modulate the adaptive evolution of species including pathogens and have important implications for sustainable disease management, especially under climate change.
Collapse
Affiliation(s)
- Oswald Nkurikiyimfura
- Institute of Plant Virology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Abdul Waheed
- Institute of Plant Virology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Hanmei Fang
- Institute of Plant Virology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Xiaoxian Yuan
- Institute of Plant Virology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Lixia Chen
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, Minjiang University, Fuzhou, 350108, China
- College of Resources and Environment, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Yan-Ping Wang
- College of Chemistry and Life Sciences, Sichuan Provincial Key Laboratory for Development and Utilization of Characteristic Horticultural Biological Resources, Chengdu Normal University, Chengdu, Sichuan, 611130, China
| | - Guodong Lu
- Department of Plant Pathology, Fujian Agriculture and Forestry University, Fuzhou, Fujian, 350002, China
| | - Jiasui Zhan
- Department of Forest Mycology and Plant Pathology, Swedish University of Agricultural Sciences, Uppsala, 75007, Sweden.
| | - Lina Yang
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, Minjiang University, Fuzhou, 350108, China.
| |
Collapse
|
2
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
3
|
Długosz M, Deorowicz S. Illumina reads correction: evaluation and improvements. Sci Rep 2024; 14:2232. [PMID: 38278837 DOI: 10.1038/s41598-024-52386-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 01/18/2024] [Indexed: 01/28/2024] Open
Abstract
The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.
Collapse
Affiliation(s)
- Maciej Długosz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100, Gliwice, Poland.
| |
Collapse
|
4
|
Pourmohammadi R, Abouei J, Anpalagan A. Error analysis of the PacBio sequencing CCS reads. Int J Biostat 2023; 19:439-453. [PMID: 37155831 DOI: 10.1515/ijb-2021-0091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/07/2022] [Indexed: 05/10/2023]
Abstract
Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer's disease targeted experiment.
Collapse
Affiliation(s)
- Reza Pourmohammadi
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Jamshid Abouei
- WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran
| | - Alagan Anpalagan
- Department of Electrical, Computer and Biomedical Engineering, Ryerson University, Toronto, Canada
| |
Collapse
|
5
|
Darnet E, Teixeira B, Schaller H, Rogez H, Darnet S. Elucidating the Mesocarp Drupe Transcriptome of Açai ( Euterpe oleracea Mart.): An Amazonian Tree Palm Producer of Bioactive Compounds. Int J Mol Sci 2023; 24:ijms24119315. [PMID: 37298279 DOI: 10.3390/ijms24119315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 05/13/2023] [Accepted: 05/16/2023] [Indexed: 06/12/2023] Open
Abstract
Euterpe oleracea palm, endemic to the Amazon region, is well known for açai, a fruit violet beverage with nutritional and medicinal properties. During E. oleracea fruit ripening, anthocyanin accumulation is not related to sugar production, contrarily to grape and blueberry. Ripened fruits have a high content of anthocyanins, isoprenoids, fibers, and proteins, and are poor in sugars. E. oleracea is proposed as a new genetic model for metabolism partitioning in the fruit. Approximately 255 million single-end-oriented reads were generated on an Ion Proton NGS platform combining fruit cDNA libraries at four ripening stages. The de novo transcriptome assembly was tested using six assemblers and 46 different combinations of parameters, a pre-processing and a post-processing step. The multiple k-mer approach with TransABySS as an assembler and Evidential Gene as a post-processer have shown the best results, with an N50 of 959 bp, a read coverage mean of 70x, a BUSCO complete sequence recovery of 36% and an RBMT of 61%. The fruit transcriptome dataset included 22,486 transcripts representing 18 Mbp, of which a proportion of 87% had significant homology with other plant sequences. Approximately 904 new EST-SSRs were described, and were common and transferable to Phoenix dactylifera and Elaeis guineensis, two other palm trees. The global GO classification of transcripts showed similar categories to that in P. dactylifera and E. guineensis fruit transcriptomes. For an accurate annotation and functional description of metabolism genes, a bioinformatic pipeline was developed to precisely identify orthologs, such as one-to-one orthologs between species, and to infer multigenic family evolution. The phylogenetic inference confirmed an occurrence of duplication events in the Arecaceae lineage and the presence of orphan genes in E. oleracea. Anthocyanin and tocopherol pathways were annotated entirely. Interestingly, the anthocyanin pathway showed a high number of paralogs, similar to in grape, whereas the tocopherol pathway exhibited a low and conserved gene number and the prediction of several splicing forms. The release of this exhaustively annotated molecular dataset of E. oleracea constitutes a valuable tool for further studies in metabolism partitioning and opens new great perspectives to study fruit physiology with açai as a model.
Collapse
Affiliation(s)
- Elaine Darnet
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
| | - Bruno Teixeira
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
| | - Hubert Schaller
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
- Plant Isoprenoid Biology, Institute of Molecular Biology of Plants of the Scientific Research National Center, Strasbourg University, 67081 Strasbourg, France
| | - Hervé Rogez
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
| | - Sylvain Darnet
- Centre for Valorization of Amazonian Bioactive Compounds (CVACBA) & Institute of Biological Sciences, Federal University of Pará (UFPA), Belém 66075-750, PA, Brazil
- International Associated Laboratory PALMHEAT, Frech Scientific Research National Center (CNRS)/UFPA, 75016 Paris, France
- Plant Isoprenoid Biology, Institute of Molecular Biology of Plants of the Scientific Research National Center, Strasbourg University, 67081 Strasbourg, France
| |
Collapse
|
6
|
Cheng C, Fei Z, Xiao P. Methods to improve the accuracy of next-generation sequencing. Front Bioeng Biotechnol 2023; 11:982111. [PMID: 36741756 PMCID: PMC9895957 DOI: 10.3389/fbioe.2023.982111] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 01/11/2023] [Indexed: 01/21/2023] Open
Abstract
Next-generation sequencing (NGS) is present in all fields of life science, which has greatly promoted the development of basic research while being gradually applied in clinical diagnosis. However, the cost and throughput advantages of next-generation sequencing are offset by large tradeoffs with respect to read length and accuracy. Specifically, its high error rate makes it extremely difficult to detect SNPs or low-abundance mutations, limiting its clinical applications, such as pharmacogenomics studies primarily based on SNP and early clinical diagnosis primarily based on low abundance mutations. Currently, Sanger sequencing is still considered to be the gold standard due to its high accuracy, so the results of next-generation sequencing require verification by Sanger sequencing in clinical practice. In order to maintain high quality next-generation sequencing data, a variety of improvements at the levels of template preparation, sequencing strategy and data processing have been developed. This study summarized the general procedures of next-generation sequencing platforms, highlighting the improvements involved in eliminating errors at each step. Furthermore, the challenges and future development of next-generation sequencing in clinical application was discussed.
Collapse
|
7
|
Expósito RR, Martínez-Sánchez M, Touriño J. SparkEC: speeding up alignment-based DNA error correction tools. BMC Bioinformatics 2022; 23:464. [PMID: 36344928 PMCID: PMC9639292 DOI: 10.1186/s12859-022-05013-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 10/26/2022] [Indexed: 11/09/2022] Open
Abstract
Background In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. Results In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× and 11.9\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}×, respectively, over its counterpart. Conclusion As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05013-1.
Collapse
Affiliation(s)
- Roberto R. Expósito
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| | - Marco Martínez-Sánchez
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| | - Juan Touriño
- grid.8073.c0000 0001 2176 8535Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071 A Coruña, Spain
| |
Collapse
|
8
|
Yang LN, Ouyang H, Nkurikiyimfura O, Fang H, Waheed A, Li W, Wang YP, Zhan J. Genetic variation along an altitudinal gradient in the Phytophthora infestans effector gene Pi02860. Front Microbiol 2022; 13:972928. [PMID: 36160230 PMCID: PMC9492930 DOI: 10.3389/fmicb.2022.972928] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Accepted: 08/10/2022] [Indexed: 11/13/2022] Open
Abstract
Effector genes, together with climatic and other environmental factors, play multifaceted roles in the development of plant diseases. Understanding the role of environmental factors, particularly climate conditions affecting the evolution of effector genes, is important for predicting the long-term value of the genes in controlling agricultural diseases. Here, we collected Phytophthora infestans populations from five locations along a mountainous hill in China and sequenced the effector gene Pi02860 from >300 isolates. To minimize the influence of other ecological factors, isolates were sampled from the same potato cultivar on the same day. We also expressed the gene to visualise its cellular location, assayed its pathogenicity and evaluated its response to experimental temperatures. We found that Pi02860 exhibited moderate genetic variation at the nucleotide level which was mainly generated by point mutation. The mutations did not change the cellular location of the effector gene but significantly modified the fitness of P. infestans. Genetic variation and pathogenicity of the effector gene were positively associated with the altitude of sample sites, possibly due to increased mutation rate induced by the vertical distribution of environmental factors such as UV radiation and temperature. We further found that Pi02860 expression was regulated by experimental temperature with reduced expression as experimental temperature increased. Together, these results indicate that UV radiation and temperature are important environmental factors regulating the evolution of effector genes and provide us with considerable insight as to their future sustainable action under climate and other environmental change.
Collapse
Affiliation(s)
- Li-Na Yang
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, Minjiang University, Fuzhou, China
- *Correspondence: Li-Na Yang,
| | - Haibing Ouyang
- Department of Plant Pathology, Nanjing Agricultural University, Nanjing, China
| | - Oswald Nkurikiyimfura
- Institute of Plant Pathology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Hanmei Fang
- Institute of Plant Pathology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Abdul Waheed
- Institute of Plant Pathology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Wenyang Li
- Institute of Plant Pathology, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Yan-Ping Wang
- College of Chemistry and Life Sciences, Sichuan Provincial Key Laboratory for Development and Utilization of Characteristic Horticultural Biological Resources, Chengdu Normal University, Chengdu, China
| | - Jiasui Zhan
- Department of Forest Mycology and Plant Pathology, Swedish University of Agricultural Sciences, Uppsala, Sweden
- Jiasui Zhan,
| |
Collapse
|
9
|
Wang YP, Yang LN, Feng YY, Liu S, Zhan J. Single Amino Acid Substitution the DNA Repairing Gene Radiation-Sensitive 4 Contributes to Ultraviolet Tolerance of a Plant Pathogen. Front Microbiol 2022; 13:927139. [PMID: 35910660 PMCID: PMC9330021 DOI: 10.3389/fmicb.2022.927139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Accepted: 06/21/2022] [Indexed: 11/13/2022] Open
Abstract
To successfully survive and reproduce, all species constantly modify the structure and expression of their genomes to cope with changing environmental conditions including ultraviolet (UV) radiation. Thus, knowledge of species adaptation to environmental changes is a central theme of evolutionary studies which could have important implication for disease management and social-ecological sustainability in the future but is generally insufficient. Here, we investigated the evolution of UV adaptation in organisms by population genetic analysis of sequence structure, physiochemistry, transcription, and fitness variation in the radiation-sensitive 4 (RAD4) gene of the Irish potato famine pathogen Phytophthora infestans sampled from various altitudes. We found that RAD4 is a key gene determining the resistance of the pathogen to UV stress as indicated by strong phenotype-genotype-geography associations and upregulated transcription after UV exposure. We also found conserved evolution in the RAD4 gene. Only five nucleotide haplotypes corresponding to three protein isoforms generated by point mutations were detected in the 140 sequences analyzed and the mutations were constrained to the N-terminal domain of the protein. Physiochemical changes associated with non-synonymous mutations generate severe fitness penalty to mutants, which are purged out by natural selection, leading to the conserved evolution observed in the gene.
Collapse
Affiliation(s)
- Yan-Ping Wang
- Sichuan Provincial Key Laboratory for Development and Utilization of Characteristic Horticultural Biological Resources, College of Chemistry and Life Sciences, Chengdu Normal University, Chengdu, China
| | - Li-Na Yang
- Institute of Oceanography, Minjiang University, Fuzhou, China
| | - Yuan-Yuan Feng
- Sichuan Provincial Key Laboratory for Development and Utilization of Characteristic Horticultural Biological Resources, College of Chemistry and Life Sciences, Chengdu Normal University, Chengdu, China
| | - Songqing Liu
- Sichuan Provincial Key Laboratory for Development and Utilization of Characteristic Horticultural Biological Resources, College of Chemistry and Life Sciences, Chengdu Normal University, Chengdu, China
- *Correspondence: Songqing Liu,
| | - Jiasui Zhan
- Department of Forest Mycology and Plant Pathology, Swedish University of Agricultural Sciences, Uppsala, Sweden
- Jiasui Zhan,
| |
Collapse
|
10
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
11
|
Furneaux B, Bahram M, Rosling A, Yorou NS, Ryberg M. Long- and short-read metabarcoding technologies reveal similar spatiotemporal structures in fungal communities. Mol Ecol Resour 2021; 21:1833-1849. [PMID: 33811446 DOI: 10.1111/1755-0998.13387] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 02/19/2021] [Accepted: 03/01/2021] [Indexed: 01/04/2023]
Abstract
Fungi form diverse communities and play essential roles in many terrestrial ecosystems, yet there are methodological challenges in taxonomic and phylogenetic placement of fungi from environmental sequences. To address such challenges, we investigated spatiotemporal structure of a fungal community using soil metabarcoding with four different sequencing strategies: short-amplicon sequencing of the ITS2 region (300-400 bp) with Illumina MiSeq, Ion Torrent Ion S5 and PacBio RS II, all from the same PCR library, as well as long-amplicon sequencing of the full ITS and partial LSU regions (1200-1600 bp) with PacBio RS II. Resulting community structure and diversity depended more on statistical method than sequencing technology. The use of long-amplicon sequencing enables construction of a phylogenetic tree from metabarcoding reads, which facilitates taxonomic identification of sequences. However, long reads present issues for denoising algorithms in diverse communities. We present a solution that splits the reads into shorter homologous regions prior to denoising, and then reconstructs the full denoised reads. In the choice between short and long amplicons, we suggest a hybrid approach using short amplicons for sampling breadth and depth, and long amplicons to characterize the local species pool for improved identification and phylogenetic analyses.
Collapse
Affiliation(s)
- Brendan Furneaux
- Program in Systematic Biology, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Mohammad Bahram
- Department of Ecology, Swedish University of Agricultural Sciences, Uppsala, Sweden.,Institute of Ecology and Earth Sciences, University of Tartu, Tartu, Estonia
| | - Anna Rosling
- Program in Evolutionary Biology, Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden
| | - Nourou S Yorou
- Research Unit in Tropical Mycology and Plant-Fungi Interactions, LEB, University of Parakou, Parakou, Benin
| | - Martin Ryberg
- Program in Systematic Biology, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
12
|
Wang YP, Wu EJ, Lurwanu Y, Ding JP, He DC, Waheed A, Nkurikiyimfura O, Liu ST, Li WY, Wang ZH, Yang L, Zhan J. Evidence for a synergistic effect of post-translational modifications and genomic composition of eEF-1α on the adaptation of Phytophthora infestans. Ecol Evol 2021; 11:5484-5496. [PMID: 34026022 PMCID: PMC8131795 DOI: 10.1002/ece3.7442] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Revised: 02/19/2021] [Accepted: 02/21/2021] [Indexed: 12/18/2022] Open
Abstract
Genetic variation plays a fundamental role in pathogen's adaptation to environmental stresses. Pathogens with low genetic variation tend to survive and proliferate more poorly due to their lack of genotypic/phenotypic polymorphisms in responding to fluctuating environments. Evolutionary theory hypothesizes that the adaptive disadvantage of genes with low genomic variation can be compensated for structural diversity of proteins through post-translation modification (PTM) but this theory is rarely tested experimentally and its implication to sustainable disease management is hardly discussed. In this study, we analyzed nucleotide characteristics of eukaryotic translation elongation factor-1α (eEF-lα) gene from 165 Phytophthora infestans isolates and the physical and chemical properties of its derived proteins. We found a low sequence variation of eEF-lα protein, possibly attributable to purifying selection and a lack of intra-genic recombination rather than reduced mutation. In the only two isoforms detected by the study, the major one accounted for >95% of the pathogen collection and displayed a significantly higher fitness than the minor one. High lysine representation enhances the opportunity of the eEF-1α protein to be methylated and the absence of disulfide bonds is consistent with the structural prediction showing that many disordered regions are existed in the protein. Methylation, structural disordering, and possibly other PTMs ensure the ability of the protein to modify its functions during biological, cellular and biochemical processes, and compensate for its adaptive disadvantage caused by sequence conservation. Our results indicate that PTMs may function synergistically with nucleotide codes to regulate the adaptive landscape of eEF-1α, possibly as well as other housekeeping genes, in P. infestans. Compensatory evolution between pre- and post-translational phase in eEF-1α could enable pathogens quickly adapting to disease management strategies while efficiently maintaining critical roles of the protein playing in biological, cellular, and biochemical activities. Implications of these results to sustainable plant disease management are discussed.
Collapse
Affiliation(s)
- Yan-Ping Wang
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - E-Jiao Wu
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Yahuza Lurwanu
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
- Department of Crop Protection Bayero University Kano Kano Nigeria
| | - Ji-Peng Ding
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Dun-Chun He
- School of Economics and Trade Fujian Jiangxia University Fuzhou China
| | - Abdul Waheed
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Oswald Nkurikiyimfura
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Shi-Ting Liu
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Wen-Yang Li
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
| | - Zong-Hua Wang
- Fujian University Key Laboratory for Plant-Microbe Interaction College of Life Sciences Fujian Agriculture and Forestry University Fuzhou China
- Institute of Oceanography Minjiang University Fuzhou China
| | - Lina Yang
- Key lab for Bio pesticide and Chemical Biology Ministry of Education Fujian Agriculture and Forestry University Fuzhou China
- Institute of Oceanography Minjiang University Fuzhou China
| | - Jiasui Zhan
- Department of Forest Mycology and Plant Pathology Swedish University of Agricultural Sciences Uppsala Sweden
| |
Collapse
|
13
|
Kuster RD, Yencho GC, Olukolu BA. ngsComposer: an automated pipeline for empirically based NGS data quality filtering. Brief Bioinform 2021; 22:6210066. [PMID: 33822850 PMCID: PMC8425578 DOI: 10.1093/bib/bbab092] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 01/19/2021] [Accepted: 03/01/2021] [Indexed: 12/26/2022] Open
Abstract
Next-generation sequencing (NGS) enables massively parallel acquisition of large-scale omics data; however, objective data quality filtering parameters are lacking. Although a useful metric, evidence reveals that platform-generated Phred values overestimate per-base quality scores. We have developed novel and empirically based algorithms that streamline NGS data quality filtering. The pipeline leverages known sequence motifs to enable empirical estimation of error rates, detection of erroneous base calls and removal of contaminating adapter sequence. The performance of motif-based error detection and quality filtering were further validated with read compression rates as an unbiased metric. Elevated error rates at read ends, where known motifs lie, tracked with propagation of erroneous base calls. Barcode swapping, an inherent problem with pooled libraries, was also effectively mitigated. The ngsComposer pipeline is suitable for various NGS protocols and platforms due to the universal concepts on which the algorithms are based.
Collapse
Affiliation(s)
- Ryan D Kuster
- Department of Entomology and Plant Pathology, University of Tennessee, USA
| | - G Craig Yencho
- Department of Horticultural Science, NC State University, USA
| | - Bode A Olukolu
- Department of Entomology and Plant Pathology, University of Tennessee, USA
| |
Collapse
|
14
|
Garcia-Garcia S, Cortese MF, Rodríguez-Algarra F, Tabernero D, Rando-Segura A, Quer J, Buti M, Rodríguez-Frías F. Next-generation sequencing for the diagnosis of hepatitis B: current status and future prospects. Expert Rev Mol Diagn 2021; 21:381-396. [PMID: 33880971 DOI: 10.1080/14737159.2021.1913055] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
INTRODUCTION Hepatitis B virus (HBV) causes a complex and persistent infection with a major impact on patients health. Viral-genome sequencing can provide valuable information for characterizing virus genotype, infection dynamics and drug and vaccine resistance. AREAS COVERED This article reviews the current literature to describe the next-generation sequencing progress that facilitated a more comprehensive study of HBV quasispecies in diagnosis and clinical monitoring. EXPERT OPINION HBV variability plays a key role in liver disease progression and treatment efficacy. Second-generation sequencing improved the sensitivity for detecting and quantifying mutations, mixed genotypes and viral recombination. Third-generation sequencing enables the analysis of the entire HBV genome, although the high error rate limits its use in clinical practice.
Collapse
Affiliation(s)
- Selene Garcia-Garcia
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Maria Francesca Cortese
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Francisco Rodríguez-Algarra
- Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - David Tabernero
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| | - Ariadna Rando-Segura
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Josep Quer
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Liver Disease Laboratory-Viral Hepatitis, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Maria Buti
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Department of Internal Medicine, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Francisco Rodríguez-Frías
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| |
Collapse
|
15
|
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
| | - Paul Medevedev
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
- Department of Biochemistry and Molecular Biology, Penn State, University Park, State College, PA, USA
- Center for Computational Biology and Bioinformatics, Penn State, University Park, State College, PA, USA
| |
Collapse
|
16
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
17
|
Zhang H, Jain C, Aluru S. A comprehensive evaluation of long read error correction methods. BMC Genomics 2020; 21:889. [PMID: 33349243 PMCID: PMC7751105 DOI: 10.1186/s12864-020-07227-0] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 11/12/2020] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. RESULTS In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. CONCLUSIONS Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE .
Collapse
Affiliation(s)
- Haowen Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Chirag Jain
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA. .,Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
| |
Collapse
|
18
|
Abstract
Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, fastp, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, fastp. To extend these findings, >6500 publicly archived sequencing datasets from Escherichia coli, Mycobacterium tuberculosis and Staphylococcus aureus were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.
Collapse
Affiliation(s)
- Stephen J Bush
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| |
Collapse
|
19
|
Steyaert A, Audenaert P, Fostier J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 2020; 21:402. [PMID: 32928110 PMCID: PMC7491180 DOI: 10.1186/s12859-020-03740-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 09/04/2020] [Indexed: 12/01/2022] Open
Abstract
Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.
Collapse
|
20
|
Asalone KC, Ryan KM, Yamadi M, Cohen AL, Farmer WG, George DJ, Joppert C, Kim K, Mughal MF, Said R, Toksoz-Exley M, Bisk E, Bracht JR. Regional sequence expansion or collapse in heterozygous genome assemblies. PLoS Comput Biol 2020; 16:e1008104. [PMID: 32735589 PMCID: PMC7423139 DOI: 10.1371/journal.pcbi.1008104] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Revised: 08/12/2020] [Accepted: 06/29/2020] [Indexed: 12/13/2022] Open
Abstract
High levels of heterozygosity present a unique genome assembly challenge and can adversely impact downstream analyses, yet is common in sequencing datasets obtained from non-model organisms. Here we show that by re-assembling a heterozygous dataset with variant parameters and different assembly algorithms, we are able to generate assemblies whose protein annotations are statistically enriched for specific gene ontology categories. While total assembly length was not significantly affected by assembly methodologies tested, the assemblies generated varied widely in fragmentation level and we show local assembly collapse or expansion underlying the enrichment or depletion of specific protein functional groups. We show that these statistically significant deviations in gene ontology groups can occur in seemingly high-quality assemblies, and result from difficult-to-detect local sequence expansion or contractions. Given the unpredictable interplay between assembly algorithm, parameter, and biological sequence data heterozygosity, we highlight the need for better measures of assembly quality than N50 value, including methods for assessing local expansion and collapse. In the genomic era, genomes must be reconstructed from fragments using computational methods, or assemblers. How do we know that a new genome assembly is correct? This is important because errors in assembly can lead to downstream problems in gene predictions and these inaccurate results can contaminate databases, affecting later comparative studies. A particular challenge occurs when a diploid organism inherits two highly divergent genome copies from its parents. While it is widely appreciated that this type of data is difficult for assemblers to handle properly, here we show that the process is prone to more errors than previously appreciated. Specifically, we document examples of regional expansion and collapse, affecting downstream gene prediction accuracy, but without changing the overall genome assembly size or other metrics of accuracy. Our results suggest that assembly evaluation methods should be altered to identify whether regional expansions and collapses are present in the genome assembly.
Collapse
Affiliation(s)
- Kathryn C. Asalone
- Biology Department, American University, Washington DC, United States of America
| | - Kara M. Ryan
- Biology Department, American University, Washington DC, United States of America
| | - Maryam Yamadi
- Biology Department, American University, Washington DC, United States of America
| | - Annastelle L. Cohen
- Biology Department, American University, Washington DC, United States of America
| | - William G. Farmer
- Biology Department, American University, Washington DC, United States of America
| | - Deborah J. George
- Biology Department, American University, Washington DC, United States of America
| | - Claudia Joppert
- Biology Department, American University, Washington DC, United States of America
| | - Kaitlyn Kim
- Biology Department, American University, Washington DC, United States of America
| | - Madeeha Froze Mughal
- Biology Department, American University, Washington DC, United States of America
| | - Rana Said
- Biology Department, American University, Washington DC, United States of America
| | - Metin Toksoz-Exley
- Mathematics and Statistics Department, American University, Washington DC, United States of America
| | - Evgeny Bisk
- Office of Information Technology, American University, Washington DC, United States of America
| | - John R. Bracht
- Biology Department, American University, Washington DC, United States of America
- * E-mail:
| |
Collapse
|
21
|
Yu Z, Du F, Ban R, Zhang Y. SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles. BMC Bioinformatics 2020; 21:331. [PMID: 32703148 PMCID: PMC7379788 DOI: 10.1186/s12859-020-03665-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2018] [Accepted: 07/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A number of simulators have been developed for emulating next-generation sequencing data by incorporating known errors such as base substitutions and indels. However, their practicality may be degraded by functional and runtime limitations. Particularly, the positional and genomic contextual information is not effectively utilized for reliably characterizing base substitution patterns, as well as the positional and contextual difference of Phred quality scores is not fully investigated. Thus, a more effective and efficient bioinformatics tool is sorely required. RESULTS Here, we introduce a novel tool, SimuSCoP, to reliably emulate complex DNA sequencing data. The base substitution patterns and the statistical behavior of quality scores in Illumina sequencing data are fully explored and integrated into the simulation model for reliably emulating datasets for different applications. In addition, an integrated and easy-to-use pipeline is employed in SimuSCoP to facilitate end-to-end simulation of complex samples, and high runtime efficiency is achieved by implementing the tool to run in multithreading with low memory consumption. These features enable SimuSCoP to gets substantial improvements in reliability, functionality, practicality and runtime efficiency. The tool is comprehensively evaluated in multiple aspects including consistency of profiles, simulation of genomic variations and complex tumor samples, and the results demonstrate the advantages of SimuSCoP over existing tools. CONCLUSIONS SimuSCoP, a new bioinformatics tool is developed to learn informative profiles from real sequencing data and reliably mimic complex data by introducing various genomic variations. We believe that the presented work will catalyse new development of downstream bioinformatics methods for analyzing sequencing data.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China.
| | - Fang Du
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
| | - Rongjun Ban
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China
| | - Yuanwei Zhang
- Hefei National Laboratory for Physical Sciences at Microscale, USTC-SJH Joint Center for Human Reproduction and Genetics, School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
22
|
Yang LN, Liu H, Duan GH, Huang YM, Liu S, Fang ZG, Wu EJ, Shang L, Zhan J. The Phytophthora infestans AVR2 Effector Escapes R2 Recognition Through Effector Disordering. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2020; 33:921-931. [PMID: 32212906 DOI: 10.1094/mpmi-07-19-0179-r] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Intrinsic disorder is a common structural characteristic of proteins and a central player in the biochemical processes of species. However, the role of intrinsic disorder in the evolution of plant-pathogen interactions is rarely investigated. Here, we explored the role of intrinsic disorder in the development of the pathogenicity in the RXLR AVR2 effector of Phytophthora infestans. We found AVR2 exhibited high nucleotide diversity generated by point mutation, early-termination, altered start codon, deletion/insertion, and intragenic recombination and is predicted to be an intrinsically disordered protein. AVR2 amino acid sequences conferring a virulent phenotype had a higher disorder tendency in both the N- and C-terminal regions compared with sequences conferring an avirulent phenotype. In addition, we also found virulent AVR2 mutants gained one or two short linear interaction motifs, the critical components of disordered proteins required for protein-protein interactions. Furthermore, virulent AVR2 mutants were predicted to be unstable and have a short protein half-life. Taken together, these results support the notion that intrinsic disorder is important for the effector function of pathogens and demonstrate that SLiM-mediated protein-protein interaction in the C-terminal effector domain might contribute greatly to the evasion of resistance-protein detection in P. infestans.
Collapse
Affiliation(s)
- Li-Na Yang
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Hao Liu
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Guo-Hua Duan
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Yan-Mei Huang
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Shiting Liu
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Zhi-Guo Fang
- Xiangyang Academy of Agricultural Sciences, Xiangyang 441057, Hubei, China
| | - E-Jiao Wu
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Liping Shang
- Key Lab for Biopesticide and Chemical Biology, Ministry of Education, Fujian Agriculture and Forestry University, Fuzhou, China
- Fujian Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, Fuzhou, Fujian 350002, China
| | - Jiasui Zhan
- State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, Fujian Agriculture and Forestry University, Fuzhou, China
- Department of Forest Mycology and Plant Pathology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| |
Collapse
|
23
|
Salmaninejad A, Motaee J, Farjami M, Alimardani M, Esmaeilie A, Pasdar A. Next-generation sequencing and its application in diagnosis of retinitis pigmentosa. Ophthalmic Genet 2020; 40:393-402. [PMID: 31755340 DOI: 10.1080/13816810.2019.1675178] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Retinitis Pigmentosa (RP) is a major cause of heritable human blindness with a high genetic heterogeneity. It is characterized by the initial degeneration of rod photoreceptors followed by cone photoreceptors. RP is also a prominent reason of visual impairment, by a global prevalence of 1:4000. RP is usually specified with nyctalopia in puberty, followed by concentric visual field loss, that reflects the main impairment of rod photoreceptors; later in the life, as disease progresses, because of cone dysfunction, central vision loss also occurs. A precise molecular diagnosis is crucial for disease characterization and clinical prognosis. DNA sequencing is a powerful tool for deciphering various causes of different human diseases. The arrival of next-generation sequencing (NGS) technologies has diminished sequencing cost and considerably augmented the throughput, making whole-genome sequencing (WGS) a conceivable way for obtaining comprehensive genomic data and a more precise clinical decision. Nevertheless, the advantages gained from NGS technologies are among a number of challenges that must be sufficiently addressed before this technique can be altered from an investigation tools to a helpful method in routine clinical practices. This article aims to provide an overview about NGS technology and its related platforms. The challenges in the analysis and choosing an appropriate NGS method likewise their potential applications in clinical diagnosis are also discussed. The merit of such technique has been reflected in some recent studies where it is shown that using NGS and molecular information could help with clinical diagnosis, providing potential treatment options or changes, up-to-date family counseling and management.
Collapse
Affiliation(s)
- Arash Salmaninejad
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Jamshid Motaee
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahsa Farjami
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Maliheh Alimardani
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | | | - Alireza Pasdar
- Department of Medical Genetics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.,Bioinformatics Research Group, Mashhad University of Medical Sciences, Mashhad, Iran.,Division of Applied Medicine,Medical School, University of Aberdeen, Foresterhill, Aberdeen, UK
| |
Collapse
|
24
|
Branco GP, Valieris R, Povoa LV, Araújo LFD, Fernandes GR, Souza JESD, Amorim MGD, Ferreira ENE, Silva ITD, Nunes DN, Dias-Neto E. A comparison between SOLiD 5500XLand Ion Torrent PGM-derived miRNA expression profiles in two breast cell lines. Genet Mol Biol 2020; 43:e20180351. [PMID: 32352476 PMCID: PMC7201575 DOI: 10.1590/1678-4685-gmb-2018-0351] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 06/06/2019] [Indexed: 11/22/2022] Open
Abstract
Next-generation sequencing (NGS) platforms allow the analysis of hundreds of
millions of molecules in a single sequencing run, revolutionizing many research
areas. NGS-based microRNA studies enable expression quantification in
unprecedented scale without the limitations of closed-platforms. Yet, whereas a
massive amount of data produced by these platforms is available, comparisons of
quantification/discovery capabilities between platforms are still lacking. Here
we compare two NGS-platforms: SOLiD and PGM, by evaluating their microRNA
identification/quantification capabilities using two breast-derived cell-lines.
A high expression correlation (R2 > 0.9) was achieved, encompassing 97% of
the miRNAs, and the few discrepancies in miRNA counts were attributable to
molecules that have very low expression. Quantification divergences indicative
of artefactual representation were seen for 14 miRNAs (higher in SOLiD-reads)
and another 10 miRNAs more abundant in PGM-data. An inspection of these revealed
an increased and statistically significant count of uracyls and uracyl-stretches
for PGM-enriched miRNAs, compared to SOLiD and to the miRBase. In parallel,
adenines and adenine-stretches were enriched for SOLiDderived miRNA reads. We
conclude that, whereas both platforms are overall consistent and can be used
interchangeably for microRNA expression studies, particular sequence features
appear to be indicative of specific platform bias, and their presence in
microRNAs should be considered for database-analyses.
Collapse
Affiliation(s)
| | - Renan Valieris
- A.C.Camargo Cancer Center, Laboratório de Biologia Computacional, São Paulo, SP, Brazil
| | - Lucas Venezian Povoa
- A.C.Camargo Cancer Center, Laboratório de Biologia Computacional, São Paulo, SP, Brazil.,Instituto Tecnológico de Aeronáutica, Divisão de Ciências Computacionais, Grupo de Inteligência Artificial e Robótica, São José dos Campos, SP, Brazil.,Instituto Federal de Educação, Ciência e Tecnologia de São Paulo, Caraguatatuba, SP, Brazil
| | | | | | | | - Maria Galli de Amorim
- A.C.Camargo Cancer Center, Laboratório de Genômica Médica, CIPE, São Paulo, SP, Brazil
| | - Elisa Napolitano E Ferreira
- A.C.Camargo Cancer Center, Laboratório de Genômica e Biologia, CIPE, São Paulo, SP, Brazil.,Grupo Fleury Pesquisa e Desenvolvimento, São Paulo, SP, Brazil
| | - Israel Tojal da Silva
- A.C.Camargo Cancer Center, Laboratório de Biologia Computacional, São Paulo, SP, Brazil
| | - Diana Noronha Nunes
- A.C.Camargo Cancer Center, Laboratório de Genômica Médica, CIPE, São Paulo, SP, Brazil
| | - Emmanuel Dias-Neto
- A.C.Camargo Cancer Center, Laboratório de Genômica Médica, CIPE, São Paulo, SP, Brazil.,Universidade de São Paulo, Faculdade de Medicina, Departamento & Instituto de Psiquiatria, Laboratório de Neurociências Alzira Denise Hertzog Silva (LIM-27), São Paulo, SP, Brazil
| |
Collapse
|
25
|
Mitchell K, Brito JJ, Mandric I, Wu Q, Knyazev S, Chang S, Martin LS, Karlsberg A, Gerasimov E, Littman R, Hill BL, Wu NC, Yang HT, Hsieh K, Chen L, Littman E, Shabani T, Enik G, Yao D, Sun R, Schroeder J, Eskin E, Zelikovsky A, Skums P, Pop M, Mangul S. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol 2020; 21:71. [PMID: 32183840 PMCID: PMC7079412 DOI: 10.1186/s13059-020-01988-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 03/06/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. RESULTS In this paper, we evaluate the ability of error correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error-correction methods. CONCLUSIONS In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Igor Mandric
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Qiaozhen Wu
- Department of Mathematics, University of California Los Angeles, 520 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA
| | - Ekaterina Gerasimov
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Russell Littman
- UCLA Bioinformatics, 621 Charles E Young Dr S, Los Angeles, CA, 90024, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Nicholas C Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Kevin Hsieh
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Linus Chen
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Eli Littman
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Taylor Shabani
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - German Enik
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Douglas Yao
- Department of Molecular, Cell, and Developmental Biology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, 650 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Jan Schroeder
- Epigenetics & Reprogramming Laboratory, Monash University, 15 Innovation Walk, Melbourne, VIC, 3800, Australia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 404 Westwood Plaza, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
- The Laboratory of Bioinformatics, I.M, Sechenov First Moscow State Medical University, Moscow, Russia, 119991
| | - Pavel Skums
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA, 30303, USA
| | - Mihai Pop
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA, 90089, USA.
| |
Collapse
|
26
|
Pérez-Losada M, Arenas M, Galán JC, Bracho MA, Hillung J, García-González N, González-Candelas F. High-throughput sequencing (HTS) for the analysis of viral populations. INFECTION GENETICS AND EVOLUTION 2020; 80:104208. [PMID: 32001386 DOI: 10.1016/j.meegid.2020.104208] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 01/21/2020] [Accepted: 01/24/2020] [Indexed: 12/12/2022]
Abstract
The development of High-Throughput Sequencing (HTS) technologies is having a major impact on the genomic analysis of viral populations. Current HTS platforms can capture nucleic acid variation across millions of genes for both selected amplicons and full viral genomes. HTS has already facilitated the discovery of new viruses, hinted new taxonomic classifications and provided a deeper and broader understanding of their diversity, population and genetic structure. Hence, HTS has already replaced standard Sanger sequencing in basic and applied research fields, but the next step is its implementation as a routine technology for the analysis of viruses in clinical settings. The most likely application of this implementation will be the analysis of viral genomics, because the huge population sizes, high mutation rates and very fast replacement of viral populations have demonstrated the limited information obtained with Sanger technology. In this review, we describe new technologies and provide guidelines for the high-throughput sequencing and genetic and evolutionary analyses of viral populations and metaviromes, including software applications. With the development of new HTS technologies, new and refurbished molecular and bioinformatic tools are also constantly being developed to process and integrate HTS data. These allow assembling viral genomes and inferring viral population diversity and dynamics. Finally, we also present several applications of these approaches to the analysis of viral clinical samples including transmission clusters and outbreak characterization.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão 4485-661, Portugal
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo, Spain; Biomedical Research Center (CINBIO), University of Vigo, 36310 Vigo, Spain.
| | - Juan Carlos Galán
- Microbiology Service, Hospital Ramón y Cajal, Madrid, Spain; CIBER in Epidemiology and Public Health, Spain.
| | - Mª Alma Bracho
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain.
| | - Julia Hillung
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Neris García-González
- Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| | - Fernando González-Candelas
- CIBER in Epidemiology and Public Health, Spain; Joint Research Unit "Infection and Public Health" FISABIO-University of Valencia, Valencia, Spain; Institute for Integrative Systems Biology (I2SysBio), CSIC-University of Valencia, Valencia, Spain.
| |
Collapse
|
27
|
Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019; 20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
Collapse
Affiliation(s)
- Arghya Kusum Das
- Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI USA
| | - Sayan Goswami
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Kisung Lee
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Seung-Jong Park
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| |
Collapse
|
28
|
Canedo-Téxon A, Ramón-Farias F, Monribot-Villanueva JL, Villafán E, Alonso-Sánchez A, Pérez-Torres CA, Ángeles G, Guerrero-Analco JA, Ibarra-Laclette E. Novel findings to the biosynthetic pathway of magnoflorine and taspine through transcriptomic and metabolomic analysis of Croton draco (Euphorbiaceae). BMC PLANT BIOLOGY 2019; 19:560. [PMID: 31852435 PMCID: PMC6921603 DOI: 10.1186/s12870-019-2195-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2019] [Accepted: 12/10/2019] [Indexed: 05/25/2023]
Abstract
BACKGROUND Croton draco is an arboreal species and its latex as well as some other parts of the plant, are traditionally used in the treatment of a wide range of ailments and diseases. Alkaloids, such as magnoflorine, prevent early atherosclerosis progression while taspine, an abundant constituent of latex, has been described as a wound-healer and antitumor-agent. Despite the great interest for these and other secondary metabolites, no omics resources existed for the species and the biosynthetic pathways of these alkaloids remain largely unknown. RESULTS To gain insights into the pathways involved in magnoflorine and taspine biosynthesis by C. draco and identify the key enzymes in these processes, we performed an integrated analysis of the transcriptome and metabolome in the major organs (roots, stem, leaves, inflorescences, and flowers) of this species. Transcript profiles were generated through high-throughput RNA-sequencing analysis while targeted and high resolution untargeted metabolomic profiling was also performed. The biosynthesis of these compounds appears to occur in the plant organs examined, but intermediaries may be translocated from the cells in which they are produced to other cells in which they accumulate. CONCLUSIONS Our results provide a framework to better understand magnoflorine and taspine biosynthesis in C. draco. In addition, we demonstrate the potential of multi-omics approaches to identify candidate genes involved in the biosynthetic pathways of interest.
Collapse
Affiliation(s)
- Anahí Canedo-Téxon
- Instituto de Ecología A.C., Red de Estudios Moleculares Avanzados, 91070 Xalapa, Veracruz, México
| | - Feliza Ramón-Farias
- Universidad Veracruzana (Campus Peñuela-Córdoba), Amatlán de los Reyes, 94945 Veracruz, México
| | | | - Emanuel Villafán
- Instituto de Ecología A.C., Red de Estudios Moleculares Avanzados, 91070 Xalapa, Veracruz, México
| | - Alexandro Alonso-Sánchez
- Instituto de Ecología A.C., Red de Estudios Moleculares Avanzados, 91070 Xalapa, Veracruz, México
| | - Claudia Anahí Pérez-Torres
- Instituto de Ecología A.C., Red de Estudios Moleculares Avanzados, 91070 Xalapa, Veracruz, México
- Catedrático CONACyT en el Instituto de Ecología A.C, Veracruz, México
| | - Guillermo Ángeles
- Instituto de Ecología A.C., Red de Ecología Funcional, 91070 Xalapa, Veracruz, México
| | | | - Enrique Ibarra-Laclette
- Instituto de Ecología A.C., Red de Estudios Moleculares Avanzados, 91070 Xalapa, Veracruz, México
| |
Collapse
|
29
|
Mittal P, Jaiswal SK, Vijay N, Saxena R, Sharma VK. Comparative analysis of corrected tiger genome provides clues to its neuronal evolution. Sci Rep 2019; 9:18459. [PMID: 31804567 PMCID: PMC6895189 DOI: 10.1038/s41598-019-54838-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 11/14/2019] [Indexed: 01/01/2023] Open
Abstract
The availability of completed and draft genome assemblies of tiger, leopard, and other felids provides an opportunity to gain comparative insights on their unique evolutionary adaptations. However, genome-wide comparative analyses are susceptible to errors in genome sequences and thus require accurate genome assemblies for reliable evolutionary insights. In this study, while analyzing the tiger genome, we found almost one million erroneous substitutions in the coding and non-coding region of the genome affecting 4,472 genes, hence, biasing the current understanding of tiger evolution. Moreover, these errors produced several misleading observations in previous studies. Thus, to gain insights into the tiger evolution, we corrected the erroneous bases in the genome assembly and gene set of tiger using ‘SeqBug’ approach developed in this study. We sequenced the first Bengal tiger genome and transcriptome from India to validate these corrections. A comprehensive evolutionary analysis was performed using 10,920 orthologs from nine mammalian species including the corrected gene sets of tiger and leopard and using five different methods at three hierarchical levels, i.e. felids, Panthera, and tiger. The unique genetic changes in tiger revealed that the genes showing signatures of adaptation in tiger were enriched in development and neuronal functioning. Specifically, the genes belonging to the Notch signalling pathway, which is among the most conserved pathways involved in embryonic and neuronal development, were found to have significantly diverged in tiger in comparison to the other mammals. Our findings suggest the role of adaptive evolution in neuronal functions and development processes, which correlates well with the presence of exceptional traits such as sensory perception, strong neuro-muscular coordination, and hypercarnivorous behaviour in tiger.
Collapse
Affiliation(s)
- Parul Mittal
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Shubham K Jaiswal
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Nagarjun Vijay
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Rituja Saxena
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India
| | - Vineet K Sharma
- Metaomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research Bhopal, Bhopal, India.
| |
Collapse
|
30
|
Marchet C, Morisse P, Lecompte L, Lefebvre A, Lecroq T, Peterlongo P, Limasset A. ELECTOR: evaluator for long reads correction methods. NAR Genom Bioinform 2019; 2:lqz015. [PMID: 33575566 PMCID: PMC7671326 DOI: 10.1093/nargab/lqz015] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/24/2019] [Accepted: 10/16/2019] [Indexed: 12/19/2022] Open
Abstract
The error rates of third-generation sequencing data have been capped >5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (>100 k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub (https://github.com/kamimrcht/ELECTOR) and Bioconda.
Collapse
Affiliation(s)
- Camille Marchet
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France.,Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| | - Pierre Morisse
- Normandie Université, UNIROUEN, INSA Rouen, LITIS, 76000 Rouen, France
| | - Lolita Lecompte
- Univ Rennes, CNRS, Inria, IRISA-UMR 6074, F-35000 Rennes, France
| | | | | | | | - Antoine Limasset
- Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59655 Villeneuve-d'Ascq, France
| |
Collapse
|
31
|
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models. Sci Rep 2019; 9:16157. [PMID: 31695060 PMCID: PMC6834855 DOI: 10.1038/s41598-019-52196-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 10/07/2019] [Indexed: 01/30/2023] Open
Abstract
The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the “perplexity” metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions—for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena’s selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.
Collapse
|
32
|
Abstract
Immune repertoire is a collection of enormously diverse adaptive immune cells within an individual. As the repertoire shapes and represents immunological conditions, identification of clones and characterization of diversity are critical for understanding how to protect ourselves against various illness such as infectious diseases and cancers. Over the past several years, fast growing technologies for high throughput sequencing have facilitated rapid advancement of repertoire research, enabling us to observe the diversity of repertoire at an unprecedented level. Here, we focus on B cell receptor (BCR) repertoire and review approaches to B cell isolation and sequencing library construction. These experiments should be carefully designed according to BCR regions to be interrogated, such as heavy chain full length, complementarity determining regions, and isotypes. We also highlight preprocessing steps to remove sequencing and PCR errors with unique molecular index and bioinformatics techniques. Due to the nature of massive sequence variation in BCR, caution is warranted when interpreting repertoire diversity from error-prone sequencing data. Furthermore, we provide a summary of statistical frameworks and bioinformatics tools for clonal evolution and diversity. Finally, we discuss limitations of current BCR-seq technologies and future perspectives on advances in repertoire sequencing.
Collapse
Affiliation(s)
- Daeun Kim
- Department of Biological Sciences, College of Natural Sciences, Ajou University, Suwon 16499, Korea
| | - Daechan Park
- Department of Biological Sciences, College of Natural Sciences, Ajou University, Suwon 16499, Korea
| |
Collapse
|
33
|
Misassembly of long reads undermines de novo-assembled ethnicity-specific genomes: validation in a Chinese Han population. Hum Genet 2019; 138:757-769. [DOI: 10.1007/s00439-019-02032-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Accepted: 05/21/2019] [Indexed: 01/05/2023]
|
34
|
|
35
|
Rosani U, Young T, Bai CM, Alfaro AC, Venier P. Dual Analysis of Virus-Host Interactions: The Case of Ostreid herpesvirus 1 and the Cupped Oyster Crassostrea gigas. Evol Bioinform Online 2019; 15:1176934319831305. [PMID: 30828244 PMCID: PMC6388457 DOI: 10.1177/1176934319831305] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 01/14/2019] [Indexed: 12/20/2022] Open
Abstract
Dual analyses of the interactions between Ostreid herpesvirus 1 (OsHV-1) and the bivalve Crassostrea gigas during infection can unveil events critical to the onset and progression of this viral disease and can provide novel strategies for mitigating and preventing oyster mortality. Among the currently used “omics” technologies, dual transcriptomics (dual RNA-seq) coupled with the analysis of viral DNA in the host tissues has greatly advanced the knowledge of genes and pathways mostly contributing to host defense responses, expression profiles of annotated and unknown OsHV-1 open reading frames (ORFs), and viral genome variability. In addition to dual RNA-seq, proteomics and metabolomics analyses have the potential to add complementary information, needed to understand how a malacoherpesvirus can redirect and exploit the vital processes of its host. This review explores our current knowledge of “omics” technologies in the study of host-pathogen interactions and highlights relevant applications of these fields of expertise to the complex case of C gigas infections by OsHV-1, which currently threaten the mollusk production sector worldwide.
Collapse
Affiliation(s)
- Umberto Rosani
- Department of Biology, University of Padova, Padova, Italy
| | - Tim Young
- Aquaculture Biotechnology Research Group, School of Science, Faculty of Health and Environmental Sciences, Auckland University of Technology, Auckland, New Zealand
| | - Chang-Ming Bai
- Key Laboratory of Maricultural Organism Disease Control, Ministry of Agriculture, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, China
| | - Andrea C Alfaro
- Aquaculture Biotechnology Research Group, School of Science, Faculty of Health and Environmental Sciences, Auckland University of Technology, Auckland, New Zealand
| | - Paola Venier
- Department of Biology, University of Padova, Padova, Italy
| |
Collapse
|
36
|
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics 2019; 36:1374-1381. [DOI: 10.1093/bioinformatics/btz102] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 01/07/2019] [Accepted: 02/18/2019] [Indexed: 12/25/2022] Open
Abstract
Abstract
Motivation
Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information.
Results
We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.
Availability and implementation
The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Antoine Limasset
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
| | - Jean-François Flot
- Evolutionary Biology & Ecology, Université Libre de Bruxelles (ULB), Bruxelles, Belgium
- Interuniversity Institute of Bioinformatics in Brussels – (IB) 2, Brussels, Belgium
| | | |
Collapse
|
37
|
Fu S, Wang A, Au KF. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol 2019; 20:26. [PMID: 30717772 PMCID: PMC6362602 DOI: 10.1186/s13059-018-1605-z] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 12/05/2018] [Indexed: 12/20/2022] Open
Abstract
Background Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods. Results Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. Conclusions Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals. Electronic supplementary material The online version of this article (10.1186/s13059-018-1605-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shuhua Fu
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | - Anqi Wang
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA. .,Department of Biostatistics, University of Iowa, Iowa City, IA, 52242, USA. .,Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
38
|
Wong KC. Big data challenges in genome informatics. Biophys Rev 2019; 11:51-54. [PMID: 30684131 DOI: 10.1007/s12551-018-0493-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 12/13/2018] [Indexed: 12/19/2022] Open
Abstract
In recent years, we have witnessed a big data explosion in genomics, thanks to the improvement in high-throughput technologies at drastically decreasing costs. We are entering the era of millions of available genomes. Notably, each genome can be composed of billions of nucleotides stored as plain text files in gigabytes (GBs). It is undeniable that those genome data impose unprecedented data challenges for us. In this article, we briefly discuss the big data challenges associated with genomics in recent years.
Collapse
Affiliation(s)
- Ka-Chun Wong
- City University of Hong Kong, Kowloon, Hong Kong.
| |
Collapse
|
39
|
Ando T, Matsuda T, Goto K, Hara K, Ito A, Hirata J, Yatomi J, Kajitani R, Okuno M, Yamaguchi K, Kobayashi M, Takano T, Minakuchi Y, Seki M, Suzuki Y, Yano K, Itoh T, Shigenobu S, Toyoda A, Niimi T. Repeated inversions within a pannier intron drive diversification of intraspecific colour patterns of ladybird beetles. Nat Commun 2018; 9:3843. [PMID: 30242156 PMCID: PMC6155092 DOI: 10.1038/s41467-018-06116-1] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 08/15/2018] [Indexed: 11/16/2022] Open
Abstract
How genetic information is modified to generate phenotypic variation within a species is one of the central questions in evolutionary biology. Here we focus on the striking intraspecific diversity of >200 aposematic elytral (forewing) colour patterns of the multicoloured Asian ladybird beetle, Harmonia axyridis, which is regulated by a tightly linked genetic locus h. Our loss-of-function analyses, genetic association studies, de novo genome assemblies, and gene expression data reveal that the GATA transcription factor gene pannier is the major regulatory gene located at the h locus, and suggest that repeated inversions and cis-regulatory modifications at pannier led to the expansion of colour pattern variation in H. axyridis. Moreover, we show that the colour-patterning function of pannier is conserved in the seven-spotted ladybird beetle, Coccinella septempunctata, suggesting that H. axyridis’ extraordinary intraspecific variation may have arisen from ancient modifications in conserved elytral colour-patterning mechanisms in ladybird beetles. The harlequin ladybird beetle, Harmonia axyridis, has remarkable phenotypic diversity, with over 200 colour patterns. Here, Ando et al. show that this patterning is regulated by the transcription factor gene pannier and has diversified by repeated inversions and cis-regulatory modifications of pannier.
Collapse
Affiliation(s)
- Toshiya Ando
- Division of Evolutionary Developmental Biology, National Institute for Basic Biology, Okazaki, Aichi, 444-8585, Japan.,Department of Basic Biology, School of Life Science, SOKENDAI (The Graduate University for Advanced Studies), Okazaki, Aichi, 444-8585, Japan
| | - Takeshi Matsuda
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Kumiko Goto
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Kimiko Hara
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Akinori Ito
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Junya Hirata
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Joichiro Yatomi
- Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan
| | - Rei Kajitani
- Department of Biological Information, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Miki Okuno
- Department of Biological Information, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Katsushi Yamaguchi
- NIBB Core Research Facilities, National Institute for Basic Biology, Okazaki, Aichi, 444-8585, Japan
| | - Masaaki Kobayashi
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Kanagawa, 214-8571, Japan
| | - Tomoyuki Takano
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Kanagawa, 214-8571, Japan
| | - Yohei Minakuchi
- Comparative Genomics Laboratory, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan
| | - Masahide Seki
- Laboratory of Systems Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8562, Japan
| | - Yutaka Suzuki
- Laboratory of Systems Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8562, Japan
| | - Kentaro Yano
- Bioinformatics Laboratory, Department of Life Sciences, School of Agriculture, Meiji University, Kawasaki, Kanagawa, 214-8571, Japan
| | - Takehiko Itoh
- Department of Biological Information, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Shuji Shigenobu
- Department of Basic Biology, School of Life Science, SOKENDAI (The Graduate University for Advanced Studies), Okazaki, Aichi, 444-8585, Japan.,NIBB Core Research Facilities, National Institute for Basic Biology, Okazaki, Aichi, 444-8585, Japan
| | - Atsushi Toyoda
- Comparative Genomics Laboratory, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan.,Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan
| | - Teruyuki Niimi
- Division of Evolutionary Developmental Biology, National Institute for Basic Biology, Okazaki, Aichi, 444-8585, Japan. .,Department of Basic Biology, School of Life Science, SOKENDAI (The Graduate University for Advanced Studies), Okazaki, Aichi, 444-8585, Japan. .,Laboratory of Sericulture and Entomoresources, Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, Aichi, 464-8601, Japan.
| |
Collapse
|
40
|
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018; 16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.
Collapse
|
41
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
42
|
Choi Y, Chan AP, Kirkness E, Telenti A, Schork NJ. Comparison of phasing strategies for whole human genomes. PLoS Genet 2018; 14:e1007308. [PMID: 29621242 PMCID: PMC5903673 DOI: 10.1371/journal.pgen.1007308] [Citation(s) in RCA: 81] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 04/17/2018] [Accepted: 03/13/2018] [Indexed: 12/17/2022] Open
Abstract
Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not 'phase' the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available 'Genome-In-A-Bottle' (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density.
Collapse
Affiliation(s)
- Yongwook Choi
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Agnes P. Chan
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Ewen Kirkness
- Human Longevity, Inc., San Diego, California, United States of America
| | - Amalio Telenti
- J. Craig Venter Institute, La Jolla, California, United States of America
| | - Nicholas J. Schork
- J. Craig Venter Institute, La Jolla, California, United States of America
- University of California San Diego, La Jolla, California, United States of America
- The Translational Genomics Research Institute (TGen), Phoenix, Arizona, United States of America
| |
Collapse
|
43
|
Hathaway NJ, Parobek CM, Juliano JJ, Bailey JA. SeekDeep: single-base resolution de novo clustering for amplicon deep sequencing. Nucleic Acids Res 2018; 46:e21. [PMID: 29202193 PMCID: PMC5829576 DOI: 10.1093/nar/gkx1201] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Revised: 11/16/2017] [Accepted: 11/20/2017] [Indexed: 01/08/2023] Open
Abstract
PCR amplicon deep sequencing continues to transform the investigation of genetic diversity in viral, bacterial, and eukaryotic populations. In eukaryotic populations such as Plasmodium falciparum infections, it is important to discriminate sequences differing by a single nucleotide polymorphism. In bacterial populations, single-base resolution can provide improved resolution towards species and strains. Here, we introduce the SeekDeep suite built around the qluster algorithm, which is capable of accurately building de novo clusters representing true, biological local haplotypes differing by just a single base. It outperforms current software, particularly at low frequencies and at low input read depths, whether resolving single-base differences or traditional OTUs. SeekDeep is open source and works with all major sequencing technologies, making it broadly useful in a wide variety of applications of amplicon deep sequencing to extract accurate and maximal biologic information.
Collapse
Affiliation(s)
- Nicholas J Hathaway
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Christian M Parobek
- Curriculum in Genetics and Molecular Biology, University of North Carolina School of Medicine, Chapel Hill, NC, USA
| | - Jonathan J Juliano
- Curriculum in Genetics and Molecular Biology, University of North Carolina School of Medicine, Chapel Hill, NC, USA
- Division of Infectious Diseases, Department of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Jeffrey A Bailey
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
- Division of Transfusion Medicine, Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| |
Collapse
|
44
|
Ivády G, Madar L, Dzsudzsák E, Koczok K, Kappelmayer J, Krulisova V, Macek M, Horváth A, Balogh I. Analytical parameters and validation of homopolymer detection in a pyrosequencing-based next generation sequencing system. BMC Genomics 2018; 19:158. [PMID: 29466940 PMCID: PMC5822529 DOI: 10.1186/s12864-018-4544-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 02/13/2018] [Indexed: 01/14/2023] Open
Abstract
Background Current technologies in next-generation sequencing are offering high throughput reads at low costs, but still suffer from various sequencing errors. Although pyro- and ion semiconductor sequencing both have the advantage of delivering long and high quality reads, problems might occur when sequencing homopolymer-containing regions, since the repeating identical bases are going to incorporate during the same synthesis cycle, which leads to uncertainty in base calling. The aim of this study was to evaluate the analytical performance of a pyrosequencing-based next-generation sequencing system in detecting homopolymer sequences using homopolymer-preintegrated plasmid constructs and human DNA samples originating from patients with cystic fibrosis. Results In the plasmid system average correct genotyping was 95.8% in 4-mers, 87.4% in 5-mers and 72.1% in 6-mers. Despite the experienced low genotyping accuracy in 5- and 6-mers, it was possible to generate amplicons with more than a 90% adequate detection rate in every homopolymer tract. When homopolymers in the CFTR gene were sequenced average accuracy was 89.3%, but varied in a wide range (52.2 – 99.1%). In all but one case, an optimal amplicon-sequencing primer combination could be identified. In that single case (7A tract in exon 14 (c.2046_2052)), none of the tested primer sets produced the required analytical performance. Conclusions Our results show that pyrosequencing is the most reliable in case of 4-mers and as homopolymer length gradually increases, accuracy deteriorates. With careful primer selection, the NGS system was able to correctly genotype all but one of the homopolymers in the CFTR gene. In conclusion, we configured a plasmid test system that can be used to assess genotyping accuracy of NGS devices and developed an accurate NGS assay for the molecular diagnosis of CF using self-designed primers for amplification and sequencing. Electronic supplementary material The online version of this article (10.1186/s12864-018-4544-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gergely Ivády
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary
| | - László Madar
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary
| | - Erika Dzsudzsák
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary
| | - Katalin Koczok
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary.,Division of Clinical Genetics, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary
| | - János Kappelmayer
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary
| | - Veronika Krulisova
- Department of Biology and Medical Genetics, Second Faculty of Medicine and University Hospital Motol, Charles University, Prague, Czech Republic
| | - Milan Macek
- Department of Biology and Medical Genetics, Second Faculty of Medicine and University Hospital Motol, Charles University, Prague, Czech Republic
| | - Attila Horváth
- Genomic Medicine and Bioinformatic Core Facility, University of Debrecen, Debrecen, Hungary
| | - István Balogh
- Department of Laboratory Medicine, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary. .,Division of Clinical Genetics, University of Debrecen, Nagyerdei krt. 98, Debrecen, H-4032, Hungary.
| |
Collapse
|
45
|
Lee B, Min H, Yoon S. MUGAN: multi-GPU accelerated AmpliconNoise server for rapid microbial diversity assessment. Bioinformatics 2018; 37:1562-1570. [PMID: 29474530 DOI: 10.1093/bioinformatics/bty096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Revised: 02/09/2018] [Accepted: 02/18/2018] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Metagenomic sequencing has become a crucial tool for obtaining a gene catalogue of operational taxonomic units (OTUs) in a microbial community. A typical metagenomic sequencing produces a large amount of data (often in the order of terabytes or more), and computational tools are indispensable for efficient processing. In particular, error correction in metagenomics is crucial for accurate and robust genetic cataloging of microbial communities. However, many existing error-correction tools take a prohibitively long time and often bottleneck the whole analysis pipeline.
Results
To overcome this computational hurdle, we analyzed and exploited the data-level parallelism that exists in the error-correction procedure and proposed a tool named MUGAN that exploits both multi-core central processing units and multiple graphics processing units for co-processing. According to the experimental results, our approach reduced not only the time demand for denoising amplicons from approximately 59 h to only 46 min, but also the overestimation of the number of OTUs, estimating 6.7 times less species-level OTUs than the baseline. In addition, our approach provides web-based intuitive visualization of results. Given its efficiency and convenience, we anticipate that our approach would greatly facilitate denoising efforts in metagenomics studies.
Availability and implementation
http://data.snu.ac.kr/pub/mugan
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Byunghan Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea
| | - Hyeyoung Min
- College of Pharmacy, Chung-Ang University, Seoul 06974, Korea
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
46
|
Wang JR, Holt J, McMillan L, Jones CD. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics 2018; 19:50. [PMID: 29426289 PMCID: PMC5807796 DOI: 10.1186/s12859-018-2051-3] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
Background Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy. Results We describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods. Conclusion Our method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.
Collapse
Affiliation(s)
- Jeremy R Wang
- Department of Genetics, University of North Carolina at Chapel Hill, CB 3280, 3144 Genome Sciences Building, 250 Bell Tower Dr, Chapel Hill, 27599, NC, USA.
| | - James Holt
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Corbin D Jones
- Department of Biology and Integrative Program for Biological and Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
47
|
A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes. Genetics 2018; 208:1631-1641. [PMID: 29367403 PMCID: PMC5887153 DOI: 10.1534/genetics.117.300589] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Accepted: 01/19/2018] [Indexed: 01/13/2023] Open
Abstract
We present a conceptually simple, sensitive, precise, and essentially nonstatistical solution for the analysis of genome variation in haploid organisms. The generation of a Perfect Match Genomic Landscape (PMGL), which computes intergenome identity with single nucleotide resolution, reveals signatures of variation wherever a query genome differs from a reference genome. Such signatures encode the precise location of different types of variants, including single nucleotide variants, deletions, insertions, and amplifications, effectively introducing the concept of a general signature of variation. The precise nature of variants is then resolved through the generation of targeted alignments between specific sets of sequence reads and known regions of the reference genome. Thus, the perfect match logic decouples the identification of the location of variants from the characterization of their nature, providing a unified framework for the detection of genome variation. We assessed the performance of the PMGL strategy via simulation experiments. We determined the variation profiles of natural genomes and of a synthetic chromosome, both in the context of haploid yeast strains. Our approach uncovered variants that have previously escaped detection. Moreover, our strategy is ideally suited for further refining high-quality reference genomes. The source codes for the automated PMGL pipeline have been deposited in a public repository.
Collapse
|
48
|
Urbina H, Breed MF, Zhao W, Lakshmi Gurrala K, Andersson SGE, Ågren J, Baldauf S, Rosling A. Specificity in Arabidopsis thaliana recruitment of root fungal communities from soil and rhizosphere. Fungal Biol 2018; 122:231-240. [PMID: 29551197 DOI: 10.1016/j.funbio.2017.12.013] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 12/23/2017] [Indexed: 01/16/2023]
Abstract
Biotic and abiotic conditions in soil pose major constraints on growth and reproductive success of plants. Fungi are important agents in plant soil interactions but the belowground mycobiota associated with plants remains poorly understood. We grew one genotype each from Sweden and Italy of the widely-studied plant model Arabidopsis thaliana. Plants were grown under controlled conditions in organic topsoil local to the Swedish genotype, and harvested after ten weeks. Total DNA was extracted from three belowground compartments: endosphere (sonicated roots), rhizosphere and bulk soil, and fungal communities were characterized from each by amplification and sequencing of the fungal barcode region ITS2. Fungal species diversity was found to decrease from bulk soil to rhizosphere to endosphere. A significant effect of plant genotype on fungal community composition was detected only in the endosphere compartment. Despite A. thaliana being a non-mycorrhizal plant, it hosts a number of known mycorrhiza fungi in its endosphere compartment, which is also colonized by endophytic, pathogenic and saprotrophic fungi. Species in the Archaeorhizomycetes were most abundant in rhizosphere samples suggesting an adaptation to environments with high nutrient turnover for some of these species. We conclude that A. thaliana endosphere fungal communities represent a selected subset of fungi recruited from soil and that plant genotype has small but significant quantitative and qualitative effects on these communities.
Collapse
Affiliation(s)
- Hector Urbina
- Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, SE-75236, Uppsala, Sweden; Department of Botany and Plant Pathology, Purdue University, 915 W State St, West Lafayette, IN, 47907, USA
| | - Martin F Breed
- Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, SE-75236, Uppsala, Sweden; School of Biological Sciences and the Environment Institute, University of Adelaide, North Terrace, SA-5005, Australia
| | - Weizhou Zhao
- Department of Molecular Evolution, Cell and Molecular Biology, Uppsala University, Husargatan 3, SE-75124, Uppsala, Sweden
| | - Kanaka Lakshmi Gurrala
- Department of Molecular Evolution, Cell and Molecular Biology, Uppsala University, Husargatan 3, SE-75124, Uppsala, Sweden
| | - Siv G E Andersson
- Department of Molecular Evolution, Cell and Molecular Biology, Uppsala University, Husargatan 3, SE-75124, Uppsala, Sweden
| | - Jon Ågren
- Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, SE-75236, Uppsala, Sweden
| | - Sandra Baldauf
- Department of Organismal Biology, Uppsala University, Norbyvägen 18D, SE-75236, Uppsala, Sweden
| | - Anna Rosling
- Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, SE-75236, Uppsala, Sweden.
| |
Collapse
|
49
|
Liu Y, Lan C, Blumenstein M, Li J. Bi-level error correction for PacBio long reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 17:899-905. [PMID: 29990239 DOI: 10.1109/tcbb.2017.2780832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The latest sequencing technologies such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines can generate long reads at the length of thousands of nucleic bases which is much longer than the reads at the length of hundreds generated by Illumina machines. However, these long reads are prone to much higher error rates, for example 15%, making downstream analysis and applications very difficult. Error correction is a process to improve the quality of sequencing data. Hybrid correction strategies have been recently proposed to combine Illumina reads of low error rates to fix sequencing errors in the noisy long reads with good performance. In this paper, we propose a new method named Bicolor, a bi-level framework of hybrid error correction for further improving the quality of PacBio long reads. At the first level, our method uses a de Bruijn graph-based error correction idea to search paths in pairs of solid -mers iteratively with an increasing length of -mer. At the second level, we combine the processed results under different parameters from the first level. In particular, a multiple sequence alignment algorithm is used to align those similar long reads, followed by a voting algorithm which determines the final base at each position of the reads. We compare the superior performance of Bicolor with three state-of-the-art methods on three real data sets. Results demonstrate that Bicolor always achieves the highest identity ratio. Bicolor also achieves a higher alignment ratio () and a higher number of aligned reads than the current methods on two data sets. On the third data set, our method is closely competitive to the current methods in terms of number of aligned reads and genome coverage. The C++ source codes of our algorithm are freely available at https://github.com/yuansliu/Bicolor.
Collapse
|
50
|
Abstract
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made.
Collapse
Affiliation(s)
- Xiuhua Si
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Qian Wang
- b Shandong Aerospace Institute of Electronic Technology , Yantai , China
| | - Lei Zhang
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Ruo Wu
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| | - Jiquan Ma
- a Department of Computer Science & Technology , Heilongjiang University , Harbin , China
| |
Collapse
|