Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Comput Biol 2017;13:e1005518. [PMID: 28437450 PMCID: PMC5421816 DOI: 10.1371/journal.pcbi.1005518] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 05/08/2017] [Accepted: 04/13/2017] [Indexed: 12/30/2022] Open

For:	Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Comput Biol 2017;13:e1005518. [PMID: 28437450 PMCID: PMC5421816 DOI: 10.1371/journal.pcbi.1005518] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 05/08/2017] [Accepted: 04/13/2017] [Indexed: 12/30/2022] Open

Number

Cited by Other Article(s)

Wright E. Accurately clustering biological sequences in linear time by relatedness sorting. Nat Commun 2024;15:3047. [PMID: 38589369 PMCID: PMC11001989 DOI: 10.1038/s41467-024-47371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open

Chen J, Yang L, Li L, Goodison S, Sun Y. Alignment-free comparison of metagenomics sequences via approximate string matching. BIOINFORMATICS ADVANCES 2022;2:vbac077. [PMID: 36388153 PMCID: PMC9645238 DOI: 10.1093/bioadv/vbac077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 09/16/2022] [Accepted: 10/19/2022] [Indexed: 11/11/2022]

Wei ZG, Zhang XD, Cao M, Liu F, Qian Y, Zhang SW. Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences. Front Microbiol 2021;12:644012. [PMID: 33841367 PMCID: PMC8024490 DOI: 10.3389/fmicb.2021.644012] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 02/17/2021] [Indexed: 12/31/2022] Open

Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020;171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]

Abstract

Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.

Collapse

Wei ZG, Zhang SW. DMSC: A Dynamic Multi-Seeds Method for Clustering 16S rRNA Sequences Into OTUs. Front Microbiol 2019;10:428. [PMID: 30915052 PMCID: PMC6422886 DOI: 10.3389/fmicb.2019.00428] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Accepted: 02/19/2019] [Indexed: 12/30/2022] Open

Abstract

Next-generation sequencing (NGS)-based 16S rRNA sequencing by jointly using the PCR amplification and NGS technology is a cost-effective technique, which has been successfully used to study the phylogeny and taxonomy of samples from complex microbiomes or environments. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is often the first step for many downstream analyses. Heuristic clustering is one of the most widely employed approaches for generating OTUs. However, most heuristic OTUs clustering methods just select one single seed sequence to represent each cluster, resulting in their outcomes suffer from either overestimation of OTUs number or sensitivity to sequencing errors. In this paper, we present a novel dynamic multi-seeds clustering method (namely DMSC) to pick OTUs. DMSC first heuristically generates clusters according to the distance threshold. When the size of a cluster reaches the pre-defined minimum size, then DMSC selects the multi-core sequences (MCS) as the seeds that are defined as the n-core sequences (n ≥ 3), in which the distance between any two sequences is less than the distance threshold. A new sequence is assigned to the corresponding cluster depending on the average distance to MCS and the distance standard deviation within the MCS. If a new sequence is added to the cluster, dynamically update the MCS until no sequence is merged into the cluster. The new method DMSC was tested on several simulated and real-life sequence datasets and also compared with the traditional heuristic methods such as CD-HIT, UCLUST, and DBH. Experimental results in terms of the inferred OTUs number, normalized mutual information (NMI) and Matthew correlation coefficient (MCC) metrics demonstrate that DMSC can produce higher quality clusters with low memory usage and reduce OTU overestimation. Additionally, DMSC is also robust to the sequencing errors. The DMSC software can be freely downloaded from https://github.com/NWPU-903PR/DMSC.

Collapse

Zheng W, Mao Q, Genco RJ, Wactawski-Wende J, Buck M, Cai Y, Sun Y. A parallel computational framework for ultra-large-scale sequence clustering analysis. Bioinformatics 2019;35:380-388. [PMID: 30010718 PMCID: PMC6931356 DOI: 10.1093/bioinformatics/bty617] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Revised: 06/14/2018] [Accepted: 07/11/2018] [Indexed: 12/30/2022] Open

Zheng W, Yang L, Genco RJ, Wactawski-Wende J, Buck M, Sun Y. SENSE: Siamese neural network for sequence embedding and alignment-free comparison. Bioinformatics 2018;35:1820-1828. [PMID: 30346493 PMCID: PMC7963080 DOI: 10.1093/bioinformatics/bty887] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Revised: 09/04/2018] [Accepted: 10/18/2018] [Indexed: 02/06/2023] Open

Zou Q, Lin G, Jiang X, Liu X, Zeng X. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform 2018;21:1-10. [PMID: 30239587 DOI: 10.1093/bib/bby090] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2018] [Revised: 08/18/2018] [Accepted: 08/18/2018] [Indexed: 12/13/2022] Open

Asgari E, Garakani K, McHardy AC, Mofrad MRK. MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples. Bioinformatics 2018;34:i32-i42. [PMID: 29950008 PMCID: PMC6022683 DOI: 10.1093/bioinformatics/bty296] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open

Abstract

Motivation

Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.

Results

A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine.

Availability and implementation

The software and datasets are available at https://llp.berkeley.edu/micropheno.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse