1
|
Tian Q, Zhang P, Zhai Y, Wang Y, Zou Q. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data. Genome Biol Evol 2024; 16:evae102. [PMID: 38748485 PMCID: PMC11135637 DOI: 10.1093/gbe/evae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.
Collapse
Affiliation(s)
- Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| |
Collapse
|
2
|
Trecarten S, Fongang B, Liss M. Current Trends and Challenges of Microbiome Research in Prostate Cancer. Curr Oncol Rep 2024; 26:477-487. [PMID: 38573440 DOI: 10.1007/s11912-024-01520-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/18/2024] [Indexed: 04/05/2024]
Abstract
PURPOSE OF REVIEW The role of the gut microbiome in prostate cancer is an emerging area of research interest. However, no single causative organism has yet been identified. The goal of this paper is to examine the role of the microbiome in prostate cancer and summarize the challenges relating to methodology in specimen collection, sequencing technology, and interpretation of results. RECENT FINDINGS Significant heterogeneity still exists in methodology for stool sampling/storage, preservative options, DNA extraction, and sequencing database selection/in silico processing. Debate persists over primer choice in amplicon sequencing as well as optimal methods for data normalization. Statistical methods for longitudinal microbiome analysis continue to undergo refinement. While standardization of methodology may help yield more consistent results for organism identification in prostate cancer, this is a difficult task due to considerable procedural variation at each step in the process. Further reproducibility and methodology research is required.
Collapse
Affiliation(s)
- Shaun Trecarten
- Department of Urology, UT Health San Antonio, 7703 Floyd Curl Dr, San Antonio, TX, 78229, USA
| | - Bernard Fongang
- Glenn Biggs Institute for Alzheimer's & Neurodegenerative Diseases, UT Health San Antonio, San Antonio, TX, USA
- Department of Biochemistry and Structural Biology, UT Health San Antonio, San Antonio, TX, USA
- Department of Population Health Sciences, UT Health San Antonio, San Antonio, TX, USA
| | - Michael Liss
- Department of Urology, UT Health San Antonio, 7703 Floyd Curl Dr, San Antonio, TX, 78229, USA.
| |
Collapse
|
3
|
Liao C, Wang L, Quon G. Microbiome-based classification models for fresh produce safety and quality evaluation. Microbiol Spectr 2024; 12:e0344823. [PMID: 38445872 PMCID: PMC10986475 DOI: 10.1128/spectrum.03448-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 02/17/2024] [Indexed: 03/07/2024] Open
Abstract
Small sample sizes and loss of sequencing reads during the microbiome data preprocessing can limit the statistical power of differentiating fresh produce phenotypes and prevent the detection of important bacterial species associated with produce contamination or quality reduction. Here, we explored a machine learning-based k-mer hash analysis strategy to identify DNA signatures predictive of produce safety (PS) and produce quality (PQ) and compared it against the amplicon sequence variant (ASV) strategy that uses a typical denoising step and ASV-based taxonomy strategy. Random forest-based classifiers for PS and PQ using 7-mer hash data sets had significantly higher classification accuracy than those using the ASV data sets. We also demonstrated that the proposed combination of integrating multiple data sets and leveraging a 7-mer hash strategy leads to better classification performance for PS and PQ compared to the ASV method but presents lower PS classification accuracy compared to the feature-selected ASV-based taxonomy strategy. Due to the current limitation of generating taxonomy using the 7-mer hash strategy, the ASV-based taxonomy strategy with remarkably less computing time and memory usage is more efficient for PS and PQ classification and applicable for important taxa identification. Results generated from this study lay the foundation for future studies that wish and need to incorporate and/or compare different microbiome sequencing data sets for the application of machine learning in the area of microbial safety and quality of food. IMPORTANCE Identification of generalizable indicators for produce safety (PS) and produce quality (PQ) improves the detection of produce contamination and quality decline. However, effective sequencing read loss during microbiome data preprocessing and the limited sample size of individual studies restrain statistical power to identify important features contributing to differentiating PS and PQ phenotypes. We applied machine learning-based models using individual and integrated k-mer hash and amplicon sequence variant (ASV) data sets for PS and PQ classification and evaluated their classification performance and found that random forest (RF)-based models using integrated 7-mer hash data sets achieved significantly higher PS and PQ classification accuracy. Due to the limitation of taxonomic analysis for the 7-mer hash, we also developed RF-based models using feature-selected ASV-based taxonomic data sets, which performed better PS classification than those using the integrated 7-mer hash data set. The RF feature selection method identified 480 PS indicators and 263 PQ indicators with a positive contribution to the PS and PQ classification.
Collapse
Affiliation(s)
- Chao Liao
- Department of Food Science and Technology, University of California Davis, Davis, California, USA
| | - Luxin Wang
- Department of Food Science and Technology, University of California Davis, Davis, California, USA
| | - Gerald Quon
- Department of Molecular and Cellular Biology, University of California Davis, Davis, California, USA
| |
Collapse
|
4
|
Li R, Ernst J. Identifying associations of de novo noncoding variants with autism through integration of gene expression, sequence and sex information. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.20.585624. [PMID: 38562739 PMCID: PMC10983996 DOI: 10.1101/2024.03.20.585624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Whole-genome sequencing (WGS) data is facilitating genome-wide identification of rare noncoding variants, while elucidating their roles in disease remains challenging. Towards this end, we first revisit a reported significant brain-related association signal of autism spectrum disorder (ASD) detected from de novo noncoding variants attributed to deep-learning and show that local GC content can capture similar association signals. We further show that the association signal appears driven by variants from male proband-female sibling pairs that are upstream of assigned genes. We then develop Expression Neighborhood Sequence Association Study (ENSAS), which utilizes gene expression correlations and sequence information, to more systematically identify phenotype-associated variant sets. Applying ENSAS to the same set of de novo variants, we identify gene expression-based neighborhoods showing significant ASD association signal, enriched for synapse-related gene ontology terms. For these top neighborhoods, we also identify chromatin states annotations of variants that are predictive of the proband-sibling local GC content differences. Our work provides new insights into associations of non-coding de novo mutations in ASD and presents an analytical framework applicable to other phenotypes.
Collapse
Affiliation(s)
- Runjia Li
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA, USA
| | - Jason Ernst
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA, USA
- Department of Biological Chemistry, University of California, Los Angeles, CA, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California, Los Angeles, CA, USA
- Computer Science Department, University of California, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
- Molecular Biology Institute, University of California, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, CA, USA
| |
Collapse
|
5
|
Ecological Observations Based on Functional Gene Sequencing Are Sensitive to the Amplicon Processing Method. mSphere 2022; 7:e0032422. [PMID: 35938727 PMCID: PMC9429940 DOI: 10.1128/msphere.00324-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Until recently, the de facto method for short-read-based amplicon reconstruction was a sequence similarity threshold approach (operational taxonomic units [OTUs]). This has changed with the amplicon sequence variant (ASV) method where distributions are fitted to abundance profiles of individual genes using a noise-error model. While OTU-based approaches are still useful for 16S rRNA/18S rRNA genes, where thresholds of 97% to 99% are used, their use for functional genes is still debatable as there is no consensus on clustering thresholds. Here, we compare OTU- and ASV-based reconstruction approaches and taxonomy assignment methods, the naive Bayesian classifier (NBC) and Bayesian lowest common ancestor (BLCA) algorithm, using a functional gene data set from the microbial nitrogen-cycling community in the Brouage mudflat (France). A range of OTU similarity thresholds and ASVs were used to compare amoA (ammonia-oxidizing archaea [AOA] and ammonia-oxidizing bacteria [AOB]), nxrB, nirS, nirK, and nrfA communities between differing sedimentary structures. Significant effects of the sedimentary structure on weighted UniFrac (WUniFrac) distances were observed for AOA amoA when using ASVs, an OTU at a threshold of 97% sequence identity (OTU-97%), and OTU-85%; AOB amoA when using OTU-85%; and nirS when using ASV, OTU-90%, and OTU-85%. For AOB amoA, significant effects of the sedimentary structures on UniFrac distances were observed when using OTU-97% but not ASVs, and the inverse was found for nrfA. Interestingly, conclusions drawn for nirK and nxrB were consistent between amplicon reconstruction methods. We also show that when the sequences in the reference database are related to the environment in question, the BLCA algorithm leads to more phylogenetically relevant classifications. However, when the reference database contains sequences more dissimilar to the ones retrieved, the NBC obtains more information. IMPORTANCE Several analysis pipelines are available to microbial ecologists to process amplicon sequencing data, yet to date, there is no consensus as to the most appropriate method, and it becomes more difficult for genes that encode a specific function (functional genes). Standardized approaches need to be adopted to increase the reliability and reproducibility of environmental amplicon-sequencing-based data sets. In this paper, we argue that the recently developed ASV approach offers a better opportunity to achieve such standardization than OTUs for functional genes. We also propose a comprehensive framework for quality filtering of the sequencing reads based on protein sequence verification.
Collapse
|
6
|
Wang H, Wang S, Zhang Y, Bi S, Zhu X. A brief review of machine learning methods for RNA methylation sites prediction. Methods 2022; 203:399-421. [DOI: 10.1016/j.ymeth.2022.03.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/15/2022] [Accepted: 03/01/2022] [Indexed: 02/07/2023] Open
|
7
|
Almeida H, Palys S, Tsang A, Diallo AB. TOUCAN: a framework for fungal biosynthetic gene cluster discovery. NAR Genom Bioinform 2020; 2:lqaa098. [PMID: 33575642 PMCID: PMC7694738 DOI: 10.1093/nargab/lqaa098] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 09/28/2020] [Accepted: 11/05/2020] [Indexed: 12/23/2022] Open
Abstract
Fungal secondary metabolites (SMs) are an important source of numerous bioactive compounds largely applied in the pharmaceutical industry, as in the production of antibiotics and anticancer medications. The discovery of novel fungal SMs can potentially benefit human health. Identifying biosynthetic gene clusters (BGCs) involved in the biosynthesis of SMs can be a costly and complex task, especially due to the genomic diversity of fungal BGCs. Previous studies on fungal BGC discovery present limited scope and can restrict the discovery of new BGCs. In this work, we introduce TOUCAN, a supervised learning framework for fungal BGC discovery. Unlike previous methods, TOUCAN is capable of predicting BGCs on amino acid sequences, facilitating its use on newly sequenced and not yet curated data. It relies on three main pillars: rigorous selection of datasets by BGC experts; combination of functional, evolutionary and compositional features coupled with outperforming classifiers; and robust post-processing methods. TOUCAN best-performing model yields 0.982 F-measure on BGC regions in the Aspergillus niger genome. Overall results show that TOUCAN outperforms previous approaches. TOUCAN focuses on fungal BGCs but can be easily adapted to expand its scope to process other species or include new features.
Collapse
Affiliation(s)
- Hayda Almeida
- Departement d'Informatique, UQAM, Montréal, QC, H2X 3Y7, Canada
| | - Sylvester Palys
- Centre for Structural and Functional Genomics, Concordia University, Montréal, QC, H4B 1R6, Canada
| | - Adrian Tsang
- Departement d'Informatique, UQAM, Montréal, QC, H2X 3Y7, Canada
| | | |
Collapse
|
8
|
Chen X, Xiong Y, Liu Y, Chen Y, Bi S, Zhu X. m5CPred-SVM: a novel method for predicting m5C sites of RNA. BMC Bioinformatics 2020; 21:489. [PMID: 33126851 PMCID: PMC7602301 DOI: 10.1186/s12859-020-03828-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Accepted: 10/21/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functions such as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA, researchers can better understand the exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost. However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement. RESULTS In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVM offered substantially higher prediction accuracy than previously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites. CONCLUSION In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species. The result shows that our model outperformed the existing state-of-art models. Our model is available for users through a web server at https://zhulab.ahu.edu.cn/m5CPred-SVM .
Collapse
Affiliation(s)
- Xiao Chen
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Yuqing Chen
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Shoudong Bi
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| |
Collapse
|
9
|
F. Escapa I, Huang Y, Chen T, Lin M, Kokaras A, Dewhirst FE, Lemon KP. Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets. MICROBIOME 2020; 8:65. [PMID: 32414415 PMCID: PMC7291764 DOI: 10.1186/s40168-020-00841-w] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/15/2020] [Indexed: 05/10/2023]
Abstract
BACKGROUND The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies. RESULTS To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1-V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1-V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets. CONCLUSION Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies. Video Abstract.
Collapse
Affiliation(s)
- Isabel F. Escapa
- Forsyth Institute (Microbiology), Cambridge, MA USA
- Department of Oral Medicine, Infection & Immunity, Harvard School of Dental Medicine, Boston, MA USA
- Department of Molecular Virology & Microbiology, Alkek Center for Metagenomics & Microbiome Research, Baylor College of Medicine, Houston, TX USA
| | - Yanmei Huang
- Forsyth Institute (Microbiology), Cambridge, MA USA
- Department of Oral Medicine, Infection & Immunity, Harvard School of Dental Medicine, Boston, MA USA
| | - Tsute Chen
- Forsyth Institute (Microbiology), Cambridge, MA USA
- Department of Oral Medicine, Infection & Immunity, Harvard School of Dental Medicine, Boston, MA USA
| | - Maoxuan Lin
- Forsyth Institute (Microbiology), Cambridge, MA USA
| | | | - Floyd E. Dewhirst
- Forsyth Institute (Microbiology), Cambridge, MA USA
- Department of Oral Medicine, Infection & Immunity, Harvard School of Dental Medicine, Boston, MA USA
| | - Katherine P. Lemon
- Forsyth Institute (Microbiology), Cambridge, MA USA
- Department of Molecular Virology & Microbiology, Alkek Center for Metagenomics & Microbiome Research, Baylor College of Medicine, Houston, TX USA
- Division of Infectious Diseases, Boston Children’s Hospital, Harvard Medical School, Boston, MA USA
- Section of Infectious Diseases, Department of Pediatrics, Texas Children’s Hospital and Baylor College of Medicine, Houston, TX USA
| |
Collapse
|
10
|
Hur M, Park SJ. Identification of Microbial Profiles in Heavy-Metal-Contaminated Soil from Full-Length 16S rRNA Reads Sequenced by a PacBio System. Microorganisms 2019; 7:microorganisms7090357. [PMID: 31527468 PMCID: PMC6780547 DOI: 10.3390/microorganisms7090357] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/10/2019] [Accepted: 09/13/2019] [Indexed: 11/16/2022] Open
Abstract
Heavy metal pollution is a serious environmental problem as it adversely affects crop production and human activity. In addition, the microbial community structure and composition are altered in heavy-metal-contaminated soils. In this study, using full-length 16S rRNA gene sequences obtained by a PacBio RS II system, we determined the microbial diversity and community structure in heavy-metal-contaminated soil. Furthermore, we investigated the microbial distribution, inferred their putative functional traits, and analyzed the environmental effects on the microbial compositions. The soil samples selected in this study were heavily and continuously contaminated with various heavy metals due to closed mines. We found that certain microorganisms (e.g., sulfur or iron oxidizers) play an important role in the biogeochemical cycle. Using phylogenetic investigation of communities by reconstruction of unobserved states (PICRUSt) analysis, we predicted Kyoto Encyclopedia of Genes and Genomes (KEGG) functional categories from abundances of microbial communities and revealed a high proportion belonging to transport, energy metabolism, and xenobiotic degradation in the studied sites. In addition, through full-length analysis, Conexibacter-like sequences, commonly identified by environmental metagenomics among the rare biosphere, were detected. In addition to microbial composition, we confirmed that environmental factors, including heavy metals, affect the microbial communities. Unexpectedly, among these environmental parameters, electrical conductivity (EC) might have more importance than other factors in a community description analysis.
Collapse
Affiliation(s)
- Moonsuk Hur
- Microorganism Resources Division, National Institute of Biological Resources, 42 Hwangyeong-ro, Incheon 22689, Korea.
| | - Soo-Je Park
- Department of Biology, Jeju National University, 102 Jejudaehak-ro, Jeju 63243, Korea.
| |
Collapse
|
11
|
Meola M, Rifa E, Shani N, Delbès C, Berthoud H, Chassard C. DAIRYdb: a manually curated reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products. BMC Genomics 2019; 20:560. [PMID: 31286860 PMCID: PMC6615214 DOI: 10.1186/s12864-019-5914-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 06/18/2019] [Indexed: 12/14/2022] Open
Abstract
Background Reads assignment to taxonomic units is a key step in microbiome analysis pipelines. To date, accurate taxonomy annotation of 16S reads, particularly at species rank, is still challenging due to the short size of read sequences and differently curated classification databases. The close phylogenetic relationship between species encountered in dairy products, however, makes it crucial to annotate species accurately to achieve sufficient phylogenetic resolution for further downstream ecological studies or for food diagnostics. Curated databases dedicated to the environment of interest are expected to improve the accuracy and resolution of taxonomy annotation. Results We provide a manually curated database composed of 10’290 full-length 16S rRNA gene sequences from prokaryotes tailored for dairy products analysis (https://github.com/marcomeola/DAIRYdb). The performance of the DAIRYdb was compared with the universal databases Silva, LTP, RDP and Greengenes. The DAIRYdb significantly outperformed all other databases independently of the classification algorithm by enabling higher accurate taxonomy annotation down to the species rank. The DAIRYdb accurately annotates over 90% of the sequences of either single or paired hypervariable regions automatically. The manually curated DAIRYdb strongly improves taxonomic annotation accuracy for microbiome studies in dairy environments. The DAIRYdb is a practical solution that enables automatization of this key step, thus facilitating the routine application of NGS microbiome analyses for microbial ecology studies and diagnostics in dairy products. Electronic supplementary material The online version of this article (10.1186/s12864-019-5914-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marco Meola
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland.
| | - Etienne Rifa
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| | - Noam Shani
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland
| | - Céline Delbès
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| | - Hélène Berthoud
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland
| | - Christophe Chassard
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| |
Collapse
|
12
|
Taxonomy based performance metrics for evaluating taxonomic assignment methods. BMC Bioinformatics 2019; 20:310. [PMID: 31185897 PMCID: PMC6561758 DOI: 10.1186/s12859-019-2896-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Accepted: 05/13/2019] [Indexed: 02/01/2023] Open
Abstract
Background Metagenomics experiments often make inferences about microbial communities by sequencing 16S and 18S rRNA, and taxonomic assignment is a fundamental step in such studies. This paper addresses the weaknesses in two types of metrics commonly used by previous studies for measuring the performance of existing taxonomic assignment methods: Sequence count based metrics and Binary error measurement. These metrics made performance evaluation results biased, less informative and mutually incomparable. Results We investigated weaknesses in two types of metrics and proposed new performance metrics including Average Taxonomy Distance (ATD) and ATD_by_Taxa, together with the visualized ATD plot. Conclusions By comparing the evaluation results from four popular taxonomic assignment methods across three test data sets, we found the new metrics more robust, informative and comparable.
Collapse
|
13
|
Zhao X, Zhang Y, Ning Q, Zhang H, Ji J, Yin M. Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer. J Theor Biol 2019; 467:39-47. [DOI: 10.1016/j.jtbi.2019.01.035] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 01/04/2019] [Accepted: 01/30/2019] [Indexed: 01/15/2023]
|
14
|
He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y. PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinformatics 2018; 19:306. [PMID: 30157750 PMCID: PMC6114832 DOI: 10.1186/s12859-018-2321-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 08/21/2018] [Indexed: 01/28/2023] Open
Abstract
Background Pseudouridylation is the most prevalent type of posttranscriptional modification in various stable RNAs of all organisms, which significantly affects many cellular processes that are regulated by RNA. Thus, accurate identification of pseudouridine (Ψ) sites in RNA will be of great benefit for understanding these cellular processes. Due to the low efficiency and high cost of current available experimental methods, it is highly desirable to develop computational methods for accurately and efficiently detecting Ψ sites in RNA sequences. However, the predictive accuracy of existing computational methods is not satisfactory and still needs improvement. Results In this study, we developed a new model, PseUI, for Ψ sites identification in three species, which are H. sapiens, S. cerevisiae, and M. musculus. Firstly, five different kinds of features including nucleotide composition (NC), dinucleotide composition (DC), pseudo dinucleotide composition (pseDNC), position-specific nucleotide propensity (PSNP), and position-specific dinucleotide propensity (PSDP) were generated based on RNA segments. Then, a sequential forward feature selection strategy was used to gain an effective feature subset with a compact representation but discriminative prediction power. Based on the selected feature subsets, we built our model by using a support vector machine (SVM). Finally, the generalization of our model was validated by both the jackknife test and independent validation tests on the benchmark datasets. The experimental results showed that our model is more accurate and stable than the previously published models. We have also provided a user-friendly web server for our model at http://zhulab.ahu.edu.cn/PseUI, and a brief instruction for the web server is provided in this paper. By using this instruction, the academic users can conveniently get their desired results without complicated calculations. Conclusion In this study, we proposed a new predictor, PseUI, to detect Ψ sites in RNA sequences. It is shown that our model outperformed the existing state-of-art models. It is expected that our model, PseUI, will become a useful tool for accurate identification of RNA Ψ sites. Electronic supplementary material The online version of this article (10.1186/s12859-018-2321-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jingjing He
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Ting Fang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Zizheng Zhang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Bei Huang
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China
| | - Xiaolei Zhu
- School of Life Sciences, Anhui University, Hefei, 230601, Anhui, China.
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
15
|
Murali A, Bhargava A, Wright ES. IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. MICROBIOME 2018; 6:140. [PMID: 30092815 PMCID: PMC6085705 DOI: 10.1186/s40168-018-0521-5] [Citation(s) in RCA: 249] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2018] [Accepted: 07/25/2018] [Indexed: 05/11/2023]
Abstract
BACKGROUND Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of "over classification" is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive. RESULTS Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats. CONCLUSIONS IDTAXA's classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online ( http://DECIPHER.codes ).
Collapse
Affiliation(s)
- Adithya Murali
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Aniruddha Bhargava
- Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Erik S. Wright
- Department of Biomedical Informatics, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, 426 Bridgeside Point II, 450 Technology Dr, Pittsburgh, PA 15219 USA
| |
Collapse
|
16
|
McGovern E, Waters SM, Blackshields G, McCabe MS. Evaluating Established Methods for Rumen 16S rRNA Amplicon Sequencing With Mock Microbial Populations. Front Microbiol 2018; 9:1365. [PMID: 29988486 PMCID: PMC6026621 DOI: 10.3389/fmicb.2018.01365] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 06/05/2018] [Indexed: 11/22/2022] Open
Abstract
The rumen microbiome scientific community has utilized amplicon sequencing as an aid in identifying potential community compositional trends that could be used as an estimation of various production and performance traits including methane emission, animal protein production efficiency, and ruminant health status. In order to translate rumen microbiome studies into executable application, there is a need for experimental and analytical concordance within the community. The objective of this study was to assess these factors in relation to selected currently established methods for 16S phylogenetic community analysis on a microbial community standard (MC) and a DNA standard (DS; ZymoBIOMICSTM). DNA was extracted from MC using the RBBC method commonly used for microbial DNA extraction from rumen digesta samples. 16S rRNA amplicon libraries were generated for the MC and DS using primers routinely used for rumen bacterial and archaeal community analysis. The primers targeted the V4 and V3–V4 region of the 16S rRNA gene and samples were subjected to both 20 and 28 polymerase chain reaction (PCR) cycles under identical cycle conditions. Sequencing was conducted using the Illumina MiSeq platform. As the bacteria contained in the microbial mock community were well-classified species, and for ease of explanation, we used the results of the Basic Local Alignment Search Tool classification to assess the DNA, PCR cycle number, and primer type. Sequence classification methodology was assessed independently. Spearman’s correlation analysis indicated that utilizing the repeated bead beating and column method for DNA extraction in combination with primers targeting the 16S rRNA gene using 20 first-round PCR cycles was sufficient for amplicon sequencing to generate a relatively accurate depiction of the bacterial communities present in rumen samples. These results also emphasize the requirement to develop and utilize positive mock community controls for all rumen microbiomic studies in order to discern errors which may arise at any step during a next-generation sequencing protocol.
Collapse
Affiliation(s)
- Emily McGovern
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Carlow, Ireland.,UCD College of Health and Agricultural Sciences, University College Dublin, Dublin, Ireland
| | - Sinéad M Waters
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Carlow, Ireland
| | - Gordon Blackshields
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Carlow, Ireland
| | - Matthew S McCabe
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Carlow, Ireland
| |
Collapse
|
17
|
McAllister T, Dunière L, Drouin P, Xu S, Wang Y, Munns K, Zaheer R. Silage review: Using molecular approaches to define the microbial ecology of silage. J Dairy Sci 2018; 101:4060-4074. [DOI: 10.3168/jds.2017-13704] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 10/21/2017] [Indexed: 12/11/2022]
|
18
|
Zhang J, Guo J, Zhang M, Yu X, Yu X, Guo W, Zeng T, Chen L. Efficient Mining Multi-mers in a Variety of Biological Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 17:949-958. [PMID: 29993642 DOI: 10.1109/tcbb.2018.2828313] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Counting the occurrence frequency of each -mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most -mer counting algorithms rely on a given k to produce single-length -mers, which is inefficient for sequence analysis for different k. Moreover, existing -mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of -mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called MulMer (Multiple-Mer mining), is proposed to mine -mers of various lengths termed multi-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNARNA and protein sequences.
Collapse
|
19
|
Liland KH, Vinje H, Snipen L. microclass: an R-package for 16S taxonomy classification. BMC Bioinformatics 2017; 18:172. [PMID: 28302051 PMCID: PMC5353803 DOI: 10.1186/s12859-017-1583-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2016] [Accepted: 03/03/2017] [Indexed: 11/10/2022] Open
Abstract
Background Taxonomic classification based on the 16S rRNA gene sequence is important for the profiling of microbial communities. In addition to giving the best possible accuracy, it is also important to quantify uncertainties in the classifications. Results We present an R package with tools for making such classifications, where the heavy computations are implemented in C++ but operated through the standard R interface. The user may train classifiers based on specialized data sets, but we also supply a ready-to-use function trained on a comprehensive training data set designed specifically for this purpose. This tool also includes some novel ways to quantify uncertainties in the classifications. Conclusions Based on input sequences of varying length and quality, we demonstrate how the output from the classifications can be used to obtain high quality taxonomic assignments from 16S sequences within the R computing environment. The package is publicly available at the Comprehensive R Archive Network. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1583-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kristian Hovde Liland
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, P.O. Box 5003, N-1432, Norway.,Nofima - Norwegian Institute of Food, Fisheries and Aquaculture Research, Osloveien 1, Ås, N-1430, Norway
| | - Hilde Vinje
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, P.O. Box 5003, N-1432, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, P.O. Box 5003, N-1432, Norway.
| |
Collapse
|
20
|
Li GQ, Liu Z, Shen HB, Yu DJ. TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans Nanobioscience 2016; 15:674-682. [DOI: 10.1109/tnb.2016.2599115] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
21
|
de la Cuesta-Zuluaga J, Escobar JS. Considerations For Optimizing Microbiome Analysis Using a Marker Gene. Front Nutr 2016; 3:26. [PMID: 27551678 PMCID: PMC4976105 DOI: 10.3389/fnut.2016.00026] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 07/26/2016] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing technologies have found a widespread use in the study of host–microbe interactions due to the increase in their throughput and their ever-decreasing costs. The analysis of human-associated microbial communities using a marker gene, particularly the 16S rRNA, has been greatly benefited from these technologies – the human gut microbiome research being a remarkable example of such analysis that has greatly expanded our understanding of microbe-mediated human health and disease, metabolism, and food absorption. 16S studies go through a series of in vitro and in silico steps that can greatly influence their outcomes. However, the lack of a standardized workflow has led to uncertainties regarding the transparency and reproducibility of gut microbiome studies. We, here, discuss the most common challenges in the archetypical 16S rRNA workflow, including the extraction of total DNA, its use as template in PCR with primers that amplify specific hypervariable regions of the gene, amplicon sequencing, the denoising and removal of low-quality reads, the detection and removal of chimeric sequences, the clustering of high-quality sequences into operational taxonomic units, and their taxonomic classification. We recommend the essential technical information that should be conveyed in publications for reproducibility of results and encourage non-experts to include procedures and available tools that mitigate most of the problems encountered in microbiome analysis.
Collapse
Affiliation(s)
| | - Juan S Escobar
- Vidarium - Nutrition, Health and Wellness Research Center, Grupo Empresarial Nutresa , Medellín , Colombia
| |
Collapse
|
22
|
Myer PR, Kim M, Freetly HC, Smith TPL. Evaluation of 16S rRNA amplicon sequencing using two next-generation sequencing technologies for phylogenetic analysis of the rumen bacterial community in steers. J Microbiol Methods 2016; 127:132-140. [PMID: 27282101 DOI: 10.1016/j.mimet.2016.06.004] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2016] [Revised: 06/03/2016] [Accepted: 06/03/2016] [Indexed: 11/16/2022]
Abstract
Next generation sequencing technologies have vastly changed the approach of sequencing of the 16S rRNA gene for studies in microbial ecology. Three distinct technologies are available for large-scale 16S sequencing. All three are subject to biases introduced by sequencing error rates, amplification primer selection, and read length, which can affect the apparent microbial community. In this study, we compared short read 16S rRNA variable regions, V1-V3, with that of near-full length 16S regions, V1-V8, using highly diverse steer rumen microbial communities, in order to examine the impact of technology selection on phylogenetic profiles. Short paired-end reads from the Illumina MiSeq platform were used to generate V1-V3 sequence, while long "circular consensus" reads from the Pacific Biosciences RSII instrument were used to generate V1-V8 data. The two platforms revealed similar microbial operational taxonomic units (OTUs), as well as similar species richness, Good's coverage, and Shannon diversity metrics. However, the V1-V8 amplified ruminal community resulted in significant increases in several orders of taxa, such as phyla Proteobacteria and Verrucomicrobia (P < 0.05). Taxonomic classification accuracy was also greater in the near full-length read. UniFrac distance matrices using jackknifed UPGMA clustering also noted differences between the communities. These data support the consensus that longer reads result in a finer phylogenetic resolution that may not be achieved by shorter 16S rRNA gene fragments. Our work on the cattle rumen bacterial community demonstrates that utilizing near full-length 16S reads may be useful in conducting a more thorough study, or for developing a niche-specific database to use in analyzing data from shorter read technologies when budgetary constraints preclude use of near-full length 16S sequencing.
Collapse
Affiliation(s)
- Phillip R Myer
- Department of Animal Science, University of Tennesse Institute of Agriculture, University of Tennessee, Knoxville, TN 37996.
| | - MinSeok Kim
- USDA-ARS, U.S. Meat Animal Research Center, Clay Center, NE 68933(1).
| | - Harvey C Freetly
- USDA-ARS, U.S. Meat Animal Research Center, Clay Center, NE 68933(1).
| | - Timothy P L Smith
- USDA-ARS, U.S. Meat Animal Research Center, Clay Center, NE 68933(1).
| |
Collapse
|