1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Reynolds G, Mumey B, Strnadova‐Neeley V, Lachowiec J. Hijacking a rapid and scalable metagenomic method reveals subgenome dynamics and evolution in polyploid plants. APPLICATIONS IN PLANT SCIENCES 2024; 12:e11581. [PMID: 39184200 PMCID: PMC11342227 DOI: 10.1002/aps3.11581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/26/2023] [Accepted: 12/20/2023] [Indexed: 08/27/2024]
Abstract
Premise The genomes of polyploid plants archive the evolutionary events leading to their present forms. However, plant polyploid genomes present numerous hurdles to the genome comparison algorithms for classification of polyploid types and exploring genome dynamics. Methods Here, the problem of intra- and inter-genome comparison for examining polyploid genomes is reframed as a metagenomic problem, enabling the use of the rapid and scalable MinHashing approach. To determine how types of polyploidy are described by this metagenomic approach, plant genomes were examined from across the polyploid spectrum for both k-mer composition and frequency with a range of k-mer sizes. In this approach, no subgenome-specific k-mers are identified; rather, whole-chromosome k-mer subspaces were utilized. Results Given chromosome-scale genome assemblies with sufficient subgenome-specific repetitive element content, literature-verified subgenomic and genomic evolutionary relationships were revealed, including distinguishing auto- from allopolyploidy and putative progenitor genome assignment. The sequences responsible were the rapidly evolving landscape of transposable elements. An investigation into the MinHashing parameters revealed that the downsampled k-mer space (genomic signatures) produced excellent approximations of sequence similarity. Furthermore, the clustering approach used for comparison of the genomic signatures is scrutinized to ensure applicability of the metagenomics-based method. Discussion The easily implementable and highly computationally efficient MinHashing-based sequence comparison strategy enables comparative subgenomics and genomics for large and complex polyploid plant genomes. Such comparisons provide evidence for polyploidy-type subgenomic assignments. In cases where subgenome-specific repeat signal may not be adequate given a chromosomes' global k-mer profile, alternative methods that are more specific but more computationally complex outperform this approach.
Collapse
Affiliation(s)
- Gillian Reynolds
- Plant Sciences and Plant Pathology DepartmentMontana State UniversityBozeman59717MontanaUSA
- Gianforte School of ComputingMontana State UniversityBozeman59717MontanaUSA
| | - Brendan Mumey
- Gianforte School of ComputingMontana State UniversityBozeman59717MontanaUSA
| | | | - Jennifer Lachowiec
- Plant Sciences and Plant Pathology DepartmentMontana State UniversityBozeman59717MontanaUSA
| |
Collapse
|
3
|
Roberts M, Josephs EB. Previously unmeasured genetic diversity explains part of Lewontin's paradox in a k-mer-based meta-analysis of 112 plant species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.17.594778. [PMID: 38798362 PMCID: PMC11118579 DOI: 10.1101/2024.05.17.594778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
At the molecular level, most evolution is expected to be neutral. A key prediction of this expectation is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. To test this idea, we calculated nucleotide diversity ( π ) and k-mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individual plants. We then compared how these different metrics correlated with proxies of population size that account for both range size and population density variation across species. We found that our population size proxies scaled anywhere from about 3 to over 20 times faster with k-mer diversity than nucleotide diversity after adjusting for evolutionary history, mating system, life cycle habit, cultivation status, and invasiveness. The relationship between k-mer diversity and population size proxies also remains significant after correcting for genome size, whereas the analogous relationship for nucleotide diversity does not. These results suggest that variation not captured by common SNP-based analyses explains part of Lewontin's paradox in plants.
Collapse
|
4
|
Tian Q, Zhang P, Zhai Y, Wang Y, Zou Q. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data. Genome Biol Evol 2024; 16:evae102. [PMID: 38748485 PMCID: PMC11135637 DOI: 10.1093/gbe/evae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.
Collapse
Affiliation(s)
- Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| |
Collapse
|
5
|
Ponsero AJ, Miller M, Hurwitz BL. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. MICROBIOME RESEARCH REPORTS 2023; 2:27. [PMID: 38058765 PMCID: PMC10696585 DOI: 10.20517/mrr.2023.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/28/2023] [Accepted: 07/12/2023] [Indexed: 12/08/2023]
Abstract
Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.
Collapse
Affiliation(s)
- Alise Jany Ponsero
- Human Microbiome Research Program, University of Helsinki, Helsinki 00290, Finland
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| | - Matthew Miller
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
| | - Bonnie Louise Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
6
|
Price C, Russell JA. AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks. Front Public Health 2023; 11:1181911. [PMID: 37497030 PMCID: PMC10368493 DOI: 10.3389/fpubh.2023.1181911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/12/2023] [Indexed: 07/28/2023] Open
Abstract
The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility.
Collapse
|
7
|
Simons AL, Theroux S, Osborne M, Nuzhdin S, Mazor R, Steele J. Zeta diversity patterns in metabarcoded lotic algal assemblages as a tool for bioassessment. ECOLOGICAL APPLICATIONS : A PUBLICATION OF THE ECOLOGICAL SOCIETY OF AMERICA 2023; 33:e2812. [PMID: 36708145 DOI: 10.1002/eap.2812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 12/07/2022] [Accepted: 12/20/2022] [Indexed: 06/18/2023]
Abstract
Assessments of the ecological health of algal assemblages in streams typically focus on measures of their local diversity and classify individuals by morphotaxonomy. Such assemblages are often connected through various ecological processes, such as dispersal, and may be more accurately assessed as components of regional-, rather than local-scale assemblages. With recent declines in the costs of sequencing and computation, it has also become increasingly feasible to use metabarcoding to more accurately classify algal species and perform regional-scale bioassessments. Recently, zeta diversity has been explored as a novel method of constructing regional bioassessments for groups of streams. Here, we model the use of zeta diversity to investigate whether stream health can be determined by the landscape diversity of algal assemblages. We also compare the use of DNA metabarcoding and morphotaxonomy classifications in these zeta diversity-based bioassessments of regional stream health. From 96 stream samples in California, we used various orders of zeta diversity to construct models of biotic integrity for multiple assemblages of diatoms, as well as hybrid assemblages of diatoms in combination with soft-bodied algae, using taxonomy data generated with both DNA sequencing as well as traditional morphotaxonomic approaches. We compared our ability to evaluate the ecological health of streams with the performance of multiple algal indices of biological condition. Our zeta diversity-based models of regional biotic integrity were more strongly correlated with existing indices for algal assemblages classified using metabarcoding compared to morphotaxonomy. Metabarcoding for diatoms and hybrid algal assemblages involved rbcL and 18S V9 primers, respectively. Importantly, we also found that these algal assemblages, independent of the classification method, are more likely to be assembled under a process of niche differentiation rather than stochastically. Taken together, these results suggest the potential for zeta diversity patterns of algal assemblages classified using metabarcoding to inform stream bioassessments.
Collapse
Affiliation(s)
- Ariel Levi Simons
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Susanna Theroux
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| | - Melisa Osborne
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Sergey Nuzhdin
- Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Raphael Mazor
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| | - Joshua Steele
- Southern California Coastal Water Research Project, Costa Mesa, California, USA
| |
Collapse
|
8
|
Pradhan UK, Meher PK, Naha S, Rao AR, Gupta A. ASLncR: a novel computational tool for prediction of abiotic stress-responsive long non-coding RNAs in plants. Funct Integr Genomics 2023; 23:113. [PMID: 37000299 DOI: 10.1007/s10142-023-01040-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 03/23/2023] [Accepted: 03/24/2023] [Indexed: 04/01/2023]
Abstract
Abiotic stresses are detrimental to plant growth and development and have a major negative impact on crop yields. A growing body of evidence indicates that a large number of long non-coding RNAs (lncRNAs) are key to many abiotic stress responses. Thus, identifying abiotic stress-responsive lncRNAs is essential in crop breeding programs in order to develop crop cultivars resistant to abiotic stresses. In this study, we have developed the first machine learning-based computational model for predicting abiotic stress-responsive lncRNAs. The lncRNA sequences which were responsive and non-responsive to abiotic stresses served as the two classes of the dataset for binary classification using the machine learning algorithms. The training dataset was created using 263 stress-responsive and 263 non-stress-responsive sequences, whereas the independent test set consists of 101 sequences from both classes. As the machine learning model can adopt only the numeric data, the Kmer features ranging from sizes 1 to 6 were utilized to represent lncRNAs in numeric form. To select important features, four different feature selection strategies were utilized. Among the seven learning algorithms, the support vector machine (SVM) achieved the highest cross-validation accuracy with the selected feature sets. The observed 5-fold cross-validation accuracy, AU-ROC, and AU-PRC were found to be 68.84, 72.78, and 75.86%, respectively. Furthermore, the robustness of the developed model (SVM with the selected feature) was evaluated using an independent test dataset, where the overall accuracy, AU-ROC, and AU-PRC were found to be 76.23, 87.71, and 88.49%, respectively. The developed computational approach was also implemented in an online prediction tool ASLncR accessible at https://iasri-sg.icar.gov.in/aslncr/ . The proposed computational model and the developed prediction tool are believed to supplement the existing effort for the identification of abiotic stress-responsive lncRNAs in plants.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | | | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
9
|
Panda A, Tuller T. Determinants of associations between codon and amino acid usage patterns of microbial communities and the environment inferred based on a cross-biome metagenomic analysis. NPJ Biofilms Microbiomes 2023; 9:5. [PMID: 36693851 PMCID: PMC9873608 DOI: 10.1038/s41522-023-00372-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 01/11/2023] [Indexed: 01/25/2023] Open
Abstract
Codon and amino acid usage were associated with almost every aspect of microbial life. However, how the environment may impact the codon and amino acid choice of microbial communities at the habitat level is not clearly understood. Therefore, in this study, we analyzed codon and amino acid usage patterns of a large number of environmental samples collected from diverse ecological niches. Our results suggested that samples derived from similar environmental niches, in general, show overall similar codon and amino acid distribution as compared to samples from other habitats. To substantiate the relative impact of the environment, we considered several factors, such as their similarity in GC content, or in functional or taxonomic abundance. Our analysis demonstrated that none of these factors can fully explain the trends that we observed at the codon or amino acid level implying a direct environmental influence on them. Further, our analysis demonstrated different levels of selection on codon bias in different microbial communities with the highest bias in host-associated environments such as the digestive system or oral samples and the lowest level of selection in soil and water samples. Considering a large number of metagenomic samples here we showed that microorganisms collected from similar environmental backgrounds exhibit similar patterns of codon and amino acid usage irrespective of the location or time from where the samples were collected. Thus our study suggested a direct impact of the environment on codon and amino usage of microorganisms that cannot be explained considering the influence of other factors.
Collapse
Affiliation(s)
- Arup Panda
- grid.12136.370000 0004 1937 0546Department of Biomedical Engineering, Tel Aviv University, Tel Aviv, 69978 Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel Aviv University, Tel Aviv, 69978, Israel.
| |
Collapse
|
10
|
Zhai H, Fukuyama J. A convenient correspondence between k-mer-based metagenomic distances and phylogenetically-informed β-diversity measures. PLoS Comput Biol 2023; 19:e1010821. [PMID: 36608056 PMCID: PMC9879504 DOI: 10.1371/journal.pcbi.1010821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 01/26/2023] [Accepted: 12/16/2022] [Indexed: 01/07/2023] Open
Abstract
k-mer-based distances are often used to describe the differences between communities in metagenome sequencing studies because of their computational convenience and history of effectiveness. Although k-mer-based distances do not use information about taxon abundances, we show that one class of k-mer distances between metagenomes (the Euclidean distance between k-mer spectra, or EKS distances) are very closely related to a class of phylogenetically-informed β-diversity measures that do explicitly use both the taxon abundances and information about the phylogenetic relationships among the taxa. Furthermore, we show that both of these distances can be interpreted as using certain features of the taxon abundances that are related to the phylogenetic tree. Our results allow practitioners to perform phylogenetically-informed analyses when they only have k-mer data available and provide a theoretical basis for using k-mer spectra with relatively small values of k (on the order of 4-5). They are also useful for analysts who wish to know more of the properties of any method based on k-mer spectra and provide insight into one class of phylogenetically-informed β-diversity measures.
Collapse
Affiliation(s)
- Hongxuan Zhai
- Department of Statistics, Indiana University Bloomington, Bloomington, Indiana, United States of America
| | - Julia Fukuyama
- Department of Statistics, Indiana University Bloomington, Bloomington, Indiana, United States of America
- * E-mail:
| |
Collapse
|
11
|
Xie XH, Huang YJ, Han GS, Yu ZG, Ma YL. Microbial characterization based on multifractal analysis of metagenomes. Front Cell Infect Microbiol 2023; 13:1117421. [PMID: 36779183 PMCID: PMC9910082 DOI: 10.3389/fcimb.2023.1117421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 01/09/2023] [Indexed: 01/28/2023] Open
Abstract
Introduction The species diversity of microbiomes is a cutting-edge concept in metagenomic research. In this study, we propose a multifractal analysis for metagenomic research. Method and Results Firstly, we visualized the chaotic game representation (CGR) of simulated metagenomes and real metagenomes. We find that metagenomes are visualized with self-similarity. Then we defined and calculated the multifractal dimension for the visualized plot of simulated and real metagenomes, respectively. By analyzing the Pearson correlation coefficients between the multifractal dimension and the traditional species diversity index, we obtain that the correlation coefficients between the multifractal dimension and the species richness index and Shannon diversity index reached the maximum value when q = 0, 1, and the correlation coefficient between the multifractal dimension and the Simpson diversity index reached the maximum value when q = 5. Finally, we apply our method to real metagenomes of the gut microbiota of 100 infants who are newborn and 4 and 12 months old. The results show that the multifractal dimensions of an infant's gut microbiomes can distinguish age differences. Conclusion and Discussion There is self-similarity among the CGRs of WGS of metagenomes, and the multifractal spectrum is an important characteristic for metagenomes. The traditional diversity indicators can be unified under the framework of multifractal analysis. These results coincided with similar results in macrobial ecology. The multifractal spectrum of infants' gut microbiomes are related to the development of the infants.
Collapse
Affiliation(s)
- Xian-hua Xie
- Key Laboratory of Jiangxi Province for Numerical Simulation and Emulation Techniques, Gannan Normal University, Ganzhoiu, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
- *Correspondence: Xian-hua Xie,
| | - Yu-jie Huang
- Key Laboratory of Jiangxi Province for Numerical Simulation and Emulation Techniques, Gannan Normal University, Ganzhoiu, China
| | - Guo-sheng Han
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Zu-guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Yuan-lin Ma
- School of Economics, Zhengzhou University of Aeronautics, Zhengzhou, China
| |
Collapse
|
12
|
Chakoory O, Comtet-Marre S, Peyret P. RiboTaxa: combined approaches for rRNA genes taxonomic resolution down to the species level from metagenomics data revealing novelties. NAR Genom Bioinform 2022; 4:lqac070. [PMID: 36159175 PMCID: PMC9492272 DOI: 10.1093/nargab/lqac070] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 08/04/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Metagenomic classifiers are widely used for the taxonomic profiling of metagenomics data and estimation of taxa relative abundance. Small subunit rRNA genes are a gold standard for phylogenetic resolution of microbiota, although the power of this marker comes down to its use as full-length. We aimed at identifying the tools that can efficiently lead to taxonomic resolution down to the species level. To reach this goal, we benchmarked the performance and accuracy of rRNA-specialized versus general-purpose read mappers, reference-targeted assemblers and taxonomic classifiers. We then compiled the best tools (BBTools, FastQC, SortMeRNA, MetaRib, EMIRGE, VSEARCH, BBMap and QIIME 2’s Sklearn classifier) to build a pipeline called RiboTaxa. Using metagenomics datasets, RiboTaxa gave the best results compared to other tools (i.e. Kraken2, Centrifuge, METAXA2, phyloFlash, SPINGO, BLCA, MEGAN) with precise taxonomic identification and relative abundance description without false positive detection (F-measure of 100% and 83.7% at genus level and species level, respectively). Using real datasets from various environments (i.e. ocean, soil, human gut) and from different approaches (e.g. metagenomics and gene capture by hybridization), RiboTaxa revealed microbial novelties not discerned by current bioinformatics analysis opening new biological perspectives in human and environmental health.
Collapse
Affiliation(s)
- Oshma Chakoory
- Université Clermont Auvergne, INRAE, MEDIS , F-63000 Clermont-Ferrand, France
| | - Sophie Comtet-Marre
- Université Clermont Auvergne, INRAE, MEDIS , F-63000 Clermont-Ferrand, France
| | - Pierre Peyret
- Université Clermont Auvergne, INRAE, MEDIS , F-63000 Clermont-Ferrand, France
| |
Collapse
|
13
|
Strain identification and quantitative analysis in microbial communities. J Mol Biol 2022; 434:167582. [DOI: 10.1016/j.jmb.2022.167582] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/31/2022] [Accepted: 04/03/2022] [Indexed: 12/14/2022]
|
14
|
Vanni C, Schechter MS, Acinas SG, Barberán A, Buttigieg PL, Casamayor EO, Delmont TO, Duarte CM, Eren AM, Finn RD, Kottmann R, Mitchell A, Sánchez P, Siren K, Steinegger M, Gloeckner FO, Fernàndez-Guerra A. Unifying the known and unknown microbial coding sequence space. eLife 2022; 11:e67667. [PMID: 35356891 PMCID: PMC9132574 DOI: 10.7554/elife.67667] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/30/2022] [Indexed: 12/02/2022] Open
Abstract
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.
Collapse
Affiliation(s)
- Chiara Vanni
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Jacobs University BremenBremenGermany
| | - Matthew S Schechter
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Department of Medicine, University of ChicagoChicagoUnited States
| | - Silvia G Acinas
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Albert Barberán
- Department of Environmental Science, University of ArizonaTucsonUnited States
| | - Pier Luigi Buttigieg
- Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener InstituteBremerhavenGermany
| | - Emilio O Casamayor
- Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for ResearchBlanesSpain
| | - Tom O Delmont
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-SaclayEvryFrance
| | - Carlos M Duarte
- Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and TechnologyThuwalSaudi Arabia
| | - A Murat Eren
- Department of Medicine, University of ChicagoChicagoUnited States
- Josephine Bay Paul Center, Marine Biological LaboratoryWoods HoleUnited States
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Renzo Kottmann
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
| | - Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Pablo Sánchez
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Kimmo Siren
- Section for Evolutionary Genomics, The GLOBE Institute, University of CopenhagenCopenhagenDenmark
| | - Martin Steinegger
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Institute of Molecular Biology and Genetics, Seoul National UniversitySeoulRepublic of Korea
| | - Frank Oliver Gloeckner
- Jacobs University BremenBremenGermany
- University of Bremen and Life Sciences and ChemistryBremenGermany
- Computing Center, Helmholtz Center for Polar and Marine ResearchBremerhavenGermany
| | - Antonio Fernàndez-Guerra
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of CopenhagenCopenhagenDenmark
| |
Collapse
|
15
|
Bennett C, Thornton M, Park C, Henry G, Zhang Y, Malladi V, Kim D. SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers. Bioinformatics 2022; 38:1830-1837. [PMID: 35134110 PMCID: PMC8963323 DOI: 10.1093/bioinformatics/btac050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 01/12/2022] [Accepted: 01/26/2022] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. RESULTS Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline. AVAILABILITY AND IMPLEMENTATION https://github.com/DaehwanKimLab/seqwho. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Chanhee Park
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Gervaise Henry
- Department of Urology, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Yun Zhang
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Venkat Malladi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | | |
Collapse
|
16
|
Tay AP, Hosking B, Hosking C, Bauer DC, Wilson LO. INSIDER: alignment-free detection of foreign DNA sequences. Comput Struct Biotechnol J 2021; 19:3810-3816. [PMID: 34285780 PMCID: PMC8273350 DOI: 10.1016/j.csbj.2021.06.045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/28/2021] [Accepted: 06/28/2021] [Indexed: 11/21/2022] Open
Abstract
External DNA sequences can be inserted into an organism's genome either through natural processes such as gene transfer, or through targeted genome engineering strategies. Being able to robustly identify such foreign DNA is a crucial capability for health and biosecurity applications, such as anti-microbial resistance (AMR) detection or monitoring gene drives. This capability does not exist for poorly characterised host genomes or with limited information about the integrated sequence. To address this, we developed the INserted Sequence Information DEtectoR (INSIDER). INSIDER analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures. We demonstrate the power of INSIDER to separate integrated DNA sequences from normal genomic sequences on a synthetic dataset simulating the insertion of a CRISPR-Cas gene drive into wild-type yeast. As a proof-of-concept, we use INSIDER to detect the exact AMR plasmid in whole genome sequencing data from a Citrobacter freundii patient isolate. INSIDER streamlines the process of identifying integrated DNA in poorly characterised wild species or when the insert is of unknown origin, thus enhancing the monitoring of emerging biosecurity threats.
Collapse
Affiliation(s)
- Aidan P. Tay
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Brendan Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Cameron Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Denis C. Bauer
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Department of Biomedical Sciences, Macquarie University, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Laurence O.W. Wilson
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| |
Collapse
|
17
|
Holding ML, Strickland JL, Rautsaw RM, Hofmann EP, Mason AJ, Hogan MP, Nystrom GS, Ellsworth SA, Colston TJ, Borja M, Castañeda-Gaytán G, Grünwald CI, Jones JM, Freitas-de-Sousa LA, Viala VL, Margres MJ, Hingst-Zaher E, Junqueira-de-Azevedo ILM, Moura-da-Silva AM, Grazziotin FG, Gibbs HL, Rokyta DR, Parkinson CL. Phylogenetically diverse diets favor more complex venoms in North American pitvipers. Proc Natl Acad Sci U S A 2021; 118:e2015579118. [PMID: 33875585 PMCID: PMC8092465 DOI: 10.1073/pnas.2015579118] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
The role of natural selection in the evolution of trait complexity can be characterized by testing hypothesized links between complex forms and their functions across species. Predatory venoms are composed of multiple proteins that collectively function to incapacitate prey. Venom complexity fluctuates over evolutionary timescales, with apparent increases and decreases in complexity, and yet the causes of this variation are unclear. We tested alternative hypotheses linking venom complexity and ecological sources of selection from diet in the largest clade of front-fanged venomous snakes in North America: the rattlesnakes, copperheads, cantils, and cottonmouths. We generated independent transcriptomic and proteomic measures of venom complexity and collated several natural history studies to quantify dietary variation. We then constructed genome-scale phylogenies for these snakes for comparative analyses. Strikingly, prey phylogenetic diversity was more strongly correlated to venom complexity than was overall prey species diversity, specifically implicating prey species' divergence, rather than the number of lineages alone, in the evolution of complexity. Prey phylogenetic diversity further predicted transcriptomic complexity of three of the four largest gene families in viper venom, showing that complexity evolution is a concerted response among many independent gene families. We suggest that the phylogenetic diversity of prey measures functionally relevant divergence in the targets of venom, a claim supported by sequence diversity in the coagulation cascade targets of venom. Our results support the general concept that the diversity of species in an ecological community is more important than their overall number in determining evolutionary patterns in predator trait complexity.
Collapse
Affiliation(s)
- Matthew L Holding
- Department of Biological Sciences, Clemson University, Clemson, SC 29634;
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Jason L Strickland
- Department of Biological Sciences, Clemson University, Clemson, SC 29634
| | - Rhett M Rautsaw
- Department of Biological Sciences, Clemson University, Clemson, SC 29634
| | - Erich P Hofmann
- Department of Biological Sciences, Clemson University, Clemson, SC 29634
| | - Andrew J Mason
- Department of Biological Sciences, Clemson University, Clemson, SC 29634
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH 43210
| | - Michael P Hogan
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Gunnar S Nystrom
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Schyler A Ellsworth
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Timothy J Colston
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Miguel Borja
- Facultad de Ciencias Biológicas, Universidad Juárez del Estado de Durango, C.P. 35010 Gómez Palacio, Dgo., Mexico
| | - Gamaliel Castañeda-Gaytán
- Facultad de Ciencias Biológicas, Universidad Juárez del Estado de Durango, C.P. 35010 Gómez Palacio, Dgo., Mexico
| | | | - Jason M Jones
- HERP.MX A.C., Villa del Álvarez, Colima 28973, Mexico
| | | | - Vincent Louis Viala
- Laboratório de Toxinologia Aplicada, Instituto Butantan, São Paulo 05503-900, Brazil
- Center of Toxins, Immune-Response and Cell Signaling, São Paulo 05503-900, Brazil
| | - Mark J Margres
- Department of Biological Sciences, Clemson University, Clemson, SC 29634
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138
| | | | - Inácio L M Junqueira-de-Azevedo
- Laboratório de Toxinologia Aplicada, Instituto Butantan, São Paulo 05503-900, Brazil
- Center of Toxins, Immune-Response and Cell Signaling, São Paulo 05503-900, Brazil
| | - Ana M Moura-da-Silva
- Laboratório de Imunopatologia, Instituto Butantan, São Paulo 05503-900, Brazil
- Instituto de Pesquisa Clínica Carlos Borborema, Fundação de Medicina Tropical Doutor Heitor Vieira Dourado, Manaus 69040, Brazil
| | - Felipe G Grazziotin
- Laboratório de Coleções Zoológicas, Instituto Butantan, São Paulo 05503-900, Brazil
| | - H Lisle Gibbs
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, OH 43210
| | - Darin R Rokyta
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Christopher L Parkinson
- Department of Biological Sciences, Clemson University, Clemson, SC 29634;
- Department of Forestry and Environmental Conservation, Clemson University, Clemson, SC 29634
| |
Collapse
|
18
|
Bakhtiari M, Park J, Ding YC, Shleizer-Burko S, Neuhausen SL, Halldórsson BV, Stefánsson K, Gymrek M, Bafna V. Variable number tandem repeats mediate the expression of proximal genes. Nat Commun 2021; 12:2075. [PMID: 33824302 PMCID: PMC8024321 DOI: 10.1038/s41467-021-22206-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 02/17/2021] [Indexed: 12/12/2022] Open
Abstract
Variable number tandem repeats (VNTRs) account for significant genetic variation in many organisms. In humans, VNTRs have been implicated in both Mendelian and complex disorders, but are largely ignored by genomic pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks to genotype a VNTR in 18 seconds on 55X whole genome data, while maintaining high accuracy. We use adVNTR-NN to genotype 10,264 VNTRs in 652 GTEx individuals. Associating VNTR length with gene expression in 46 tissues, we identify 163 "eVNTRs". Of the 22 eVNTRs in blood where independent data is available, 21 (95%) are replicated in terms of significance and direction of association. 49% of the eVNTR loci show a strong and likely causal impact on the expression of genes and 80% have maximum effect size at least 0.3. The impacted genes are involved in diseases including Alzheimer's, obesity and familial cancers, highlighting the importance of VNTRs for understanding the genetic basis of complex diseases.
Collapse
Affiliation(s)
- Mehrdad Bakhtiari
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Yuan-Chun Ding
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA, USA
| | | | - Susan L Neuhausen
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA, USA
| | | | | | - Melissa Gymrek
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
19
|
Bize A, Midoux C, Mariadassou M, Schbath S, Forterre P, Da Cunha V. Exploring short k-mer profiles in cells and mobile elements from Archaea highlights the major influence of both the ecological niche and evolutionary history. BMC Genomics 2021; 22:186. [PMID: 33726663 PMCID: PMC7962313 DOI: 10.1186/s12864-021-07471-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 02/24/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND K-mer-based methods have greatly advanced in recent years, largely driven by the realization of their biological significance and by the advent of next-generation sequencing. Their speed and their independence from the annotation process are major advantages. Their utility in the study of the mobilome has recently emerged and they seem a priori adapted to the patchy gene distribution and the lack of universal marker genes of viruses and plasmids. To provide a framework for the interpretation of results from k-mer based methods applied to archaea or their mobilome, we analyzed the 5-mer DNA profiles of close to 600 archaeal cells, viruses and plasmids. Archaea is one of the three domains of life. Archaea seem enriched in extremophiles and are associated with a high diversity of viral and plasmid families, many of which are specific to this domain. We explored the dataset structure by multivariate and statistical analyses, seeking to identify the underlying factors. RESULTS For cells, the 5-mer profiles were inconsistent with the phylogeny of archaea. At a finer taxonomic level, the influence of the taxonomy and the environmental constraints on 5-mer profiles was very strong. These two factors were interdependent to a significant extent, and the respective weights of their contributions varied according to the clade. A convergent adaptation was observed for the class Halobacteria, for which a strong 5-mer signature was identified. For mobile elements, coevolution with the host had a clear influence on their 5-mer profile. This enabled us to identify one previously known and one new case of recent host transfer based on the atypical composition of the mobile elements involved. Beyond the effect of coevolution, extrachromosomal elements strikingly retain the specific imprint of their own viral or plasmid taxonomic family in their 5-mer profile. CONCLUSION This specific imprint confirms that the evolution of extrachromosomal elements is driven by multiple parameters and is not restricted to host adaptation. In addition, we detected only recent host transfer events, suggesting the fast evolution of short k-mer profiles. This calls for caution when using k-mers for host prediction, metagenomic binning or phylogenetic reconstruction.
Collapse
Affiliation(s)
- Ariane Bize
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.
| | - Cédric Midoux
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.,Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Mahendra Mariadassou
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Sophie Schbath
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Patrick Forterre
- Institut Pasteur, Unité de Virologie des Archées, Département de Microbiologie, 25 Rue du Docteur Roux, 75015, Paris, France. .,Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Violette Da Cunha
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
20
|
Abstract
K-mer based comparisons have emerged as powerful complements to BLAST-like alignment algorithms, particularly when the sequences being compared lack direct evolutionary relationships. In this chapter, we describe methods to compare k-mer content between groups of long noncoding RNAs (lncRNAs), to identify communities of lncRNAs with related k-mer contents, to identify the enrichment of protein-binding motifs in lncRNAs, and to scan for domains of related k-mer contents in lncRNAs. Our step-by-step instructions are complemented by Python code deposited in Github. Though our chapter focuses on lncRNAs, the methods we describe could be applied to any set of nucleic acid sequences.
Collapse
Affiliation(s)
- Jessime M Kirk
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Invitae Corporation, San Francisco, CA, USA
| | - Daniel Sprague
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Flagship Pioneering, Boston, MA, USA
| | - J Mauro Calabrese
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
21
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
22
|
Kirzhner V, Toledano-Kitai D, Volkovich Z. Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLoS One 2020; 15:e0237205. [PMID: 33156862 PMCID: PMC7647110 DOI: 10.1371/journal.pone.0237205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 10/22/2020] [Indexed: 01/02/2023] Open
Abstract
Determination of metagenome composition is still one of the most interesting problems of bioinformatics. It involves a wide range of mathematical methods, from probabilistic models of combinatorics to cluster analysis and pattern recognition techniques. The successful advance of rapid sequencing methods and fast and precise metagenome analysis will increase the diagnostic value of healthy or pathological human metagenomes. The article presents the theoretical foundations of the algorithm for calculating the number of different genomes in the medium under study. The approach is based on analysis of the compositional spectra of subsequently sequenced samples of the medium. Its essential feature is using random fluctuations in the bacteria number in different samples of the same metagenome. The possibility of effective implementation of the algorithm in the presence of data errors is also discussed. In the work, the algorithm of a metagenome evaluation is described, including the estimation of the genome number and the identification of the genomes with known compositional spectra. It should be emphasized that evaluating the genome number in a metagenome can be always helpful, regardless of the metagenome separation techniques, such as clustering the sequencing results or marker analysis.
Collapse
Affiliation(s)
- Valery Kirzhner
- Institute of Evolution, University of Haifa, Haifa, Israel
- * E-mail:
| | - Dvora Toledano-Kitai
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| | - Zeev Volkovich
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| |
Collapse
|
23
|
Comin M, Di Camillo B, Pizzi C, Vandin F. Comparison of microbiome samples: methods and computational challenges. Brief Bioinform 2020; 22:88-95. [PMID: 32577746 DOI: 10.1093/bib/bbaa121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2019] [Revised: 05/09/2020] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
The study of microbial communities crucially relies on the comparison of metagenomic next-generation sequencing data sets, for which several methods have been designed in recent years. Here, we review three key challenges in the comparison of such data sets: species identification and quantification, the efficient computation of distances between metagenomic samples and the identification of metagenomic features associated with a phenotype such as disease status. We present current solutions for such challenges, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
Collapse
|
24
|
Beier S, Ulpinnis C, Schwalbe M, Münch T, Hoffie R, Koeppel I, Hertig C, Budhagatapalli N, Hiekel S, Pathi KM, Hensel G, Grosse M, Chamas S, Gerasimova S, Kumlehn J, Scholz U, Schmutzer T. Kmasker plants - a tool for assessing complex sequence space in plant species. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2020; 102:631-642. [PMID: 31823436 DOI: 10.1111/tpj.14645] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 11/27/2019] [Accepted: 11/28/2019] [Indexed: 06/10/2023]
Abstract
Many plant genomes display high levels of repetitive sequences. The assembly of these complex genomes using short high-throughput sequence reads is still a challenging task. Underestimation or disregard of repeat complexity in these datasets can easily misguide downstream analysis. Detection of repetitive regions by k-mer counting methods has proved to be reliable. Easy-to-use applications utilizing k-mer counting are in high demand, especially in the domain of plants. We present Kmasker plants, a tool that uses k-mer count information as an assistant throughout the analytical workflow of genome data that is provided as a command-line and web-based solution. Beside its core competence to screen and mask repetitive sequences, we have integrated features that enable comparative studies between different cultivars or closely related species and methods that estimate target specificity of guide RNAs for application of site-directed mutagenesis using Cas9 endonuclease. In addition, we have set up a web service for Kmasker plants that maintains pre-computed indices for 10 of the economically most important cultivated plants. Source code for Kmasker plants has been made publically available at https://github.com/tschmutzer/kmasker. The web service is accessible at https://kmasker.ipk-gatersleben.de.
Collapse
Affiliation(s)
- Sebastian Beier
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Chris Ulpinnis
- Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, 06120, Halle, Germany
| | - Markus Schwalbe
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Thomas Münch
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Robert Hoffie
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Iris Koeppel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Christian Hertig
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Nagaveni Budhagatapalli
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Stefan Hiekel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Krishna M Pathi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Goetz Hensel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Martin Grosse
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Sindy Chamas
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Sophia Gerasimova
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Jochen Kumlehn
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Thomas Schmutzer
- Department of Natural Sciences III, Institute for Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, 06120, Halle, Germany
| |
Collapse
|
25
|
Peng H. CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification. PeerJ 2020; 8:e8965. [PMID: 32341900 PMCID: PMC7179567 DOI: 10.7717/peerj.8965] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 03/24/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP.
Collapse
Affiliation(s)
- He Peng
- School of Information Science and Engineering, Xiamen University, Xiamen, Fujian, China
| |
Collapse
|
26
|
Goussarov G, Cleenwerck I, Mysara M, Leys N, Monsieurs P, Tahon G, Carlier A, Vandamme P, Van Houdt R. PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing. Bioinformatics 2020; 36:2337-2344. [PMID: 31899493 PMCID: PMC7178395 DOI: 10.1093/bioinformatics/btz964] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 11/21/2019] [Accepted: 12/30/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. RESULTS Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. AVAILABILITY AND IMPLEMENTATION The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Ilse Cleenwerck
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Mohamed Mysara
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Natalie Leys
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Pieter Monsieurs
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Guillaume Tahon
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Aurélien Carlier
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
- LIPM, Université de Toulouse, INRAE, CNRS, Castanet-Tolosan, France
| | - Peter Vandamme
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Rob Van Houdt
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| |
Collapse
|
27
|
Sun JH, Ai SM, Liu SQ. Methylation-driven model for analysis of dinucleotide evolution in genomes. Theor Biol Med Model 2020; 17:3. [PMID: 32264909 PMCID: PMC7140373 DOI: 10.1186/s12976-020-00122-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 03/10/2020] [Indexed: 11/16/2022] Open
Abstract
Background CpGs, the major methylation sites in vertebrate genomes, exhibit a high mutation rate from the methylated form of CpG to TpG/CpA and, therefore, influence the evolution of genome composition. However, the quantitative effects of CpG to TpG/CpA mutations on the evolution of genome composition in terms of the dinucleotide frequencies/proportions remain poorly understood. Results Based on the neutral theory of molecular evolution, we propose a methylation-driven model (MDM) that allows predicting the changes in frequencies/proportions of the 16 dinucleotides and in the GC content of a genome given the known number of CpG to TpG/CpA mutations. The application of MDM to the 10 published vertebrate genomes shows that, for most of the 16 dinucleotides and the GC content, a good consistency is achieved between the predicted and observed trends of changes in the frequencies and content relative to the assumed initial values, and that the model performs better on the mammalian genomes than it does on the lower-vertebrate genomes. The model’s performance depends on the genome composition characteristics, the assumed initial state of the genome, and the estimated parameters, one or more of which are responsible for the different application effects on the mammalian and lower-vertebrate genomes and for the large deviations of the predicted frequencies of a few dinucleotides from their observed frequencies. Conclusions Despite certain limitations of the current model, the successful application to the higher-vertebrate (mammalian) genomes witnesses its potential for facilitating studies aimed at understanding the role of methylation in driving the evolution of genome dinucleotide composition.
Collapse
Affiliation(s)
- Jian-Hong Sun
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan & School of Life Sciences, Yunnan University, Kunming, 650091, China.,College of Engineering, Honghe University, Mengzi, 661100, China
| | - Shi-Meng Ai
- Department of Applied Mathematics, Yunnan Agricultural University, Kunming, 650201, China
| | - Shu-Qun Liu
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan & School of Life Sciences, Yunnan University, Kunming, 650091, China.
| |
Collapse
|
28
|
LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 2019; 166:74-82. [PMID: 30885720 PMCID: PMC6708502 DOI: 10.1016/j.ymeth.2019.03.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 02/14/2019] [Accepted: 03/04/2019] [Indexed: 01/21/2023] Open
Abstract
The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
29
|
Dougan TJ, Quake SR. Viral taxonomy derived from evolutionary genome relationships. PLoS One 2019; 14:e0220440. [PMID: 31412051 PMCID: PMC6693820 DOI: 10.1371/journal.pone.0220440] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Accepted: 07/16/2019] [Indexed: 11/23/2022] Open
Abstract
We describe a new genome alignment-based model for understanding the diversity of viruses based on evolutionary genetic relationships. This approach uses information theory and a physical model to determine the information shared by the genes in two genomes. Pairwise comparisons of genes from the viruses are created from alignments using NCBI BLAST, and their match scores are combined to produce a metric between genomes, which is in turn used to determine a global classification using the 5,817 viruses on RefSeq. In cases where there is no measurable alignment between any genes, the method falls back to a coarser measure of genome relationship: the mutual information of 4-mer frequency. This results in a principled model which depends only on the genome sequence, which captures many interesting relationships between viral families, and which creates clusters which correlate well with both the Baltimore and ICTV classifications. The incremental computational cost of classifying a novel virus is low and therefore newly discovered viruses can be quickly identified and classified. The model goes beyond alignment-free classifications by producing a full phylogeny similar to those constructed by virologists using qualitative features, while relying only on objective genes. These results bolster the case for mathematical models in microbiology which can characterize organisms using only their genetic material and provide an independent check for phylogenies constructed by humans, considerably faster and more cheaply than less modern approaches.
Collapse
Affiliation(s)
- Tyler J Dougan
- Department of Physics, Stanford University, Stanford, California, United States of America
| | - Stephen R Quake
- Departments of Bioengineering and Applied Physics, Stanford University and Chan Zuckerberg Biohub, Stanford, California, United States of America
| |
Collapse
|
30
|
Rowe WPM, Carrieri AP, Alcon-Giner C, Caim S, Shaw A, Sim K, Kroll JS, Hall LJ, Pyzer-Knapp EO, Winn MD. Streaming histogram sketching for rapid microbiome analytics. MICROBIOME 2019; 7:40. [PMID: 30878035 PMCID: PMC6420756 DOI: 10.1186/s40168-019-0653-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 03/01/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. RESULTS We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed 'histosketch' that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a 'real life' example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. CONCLUSIONS Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. ( https://github.com/will-rowe/hulk ).
Collapse
Affiliation(s)
- Will PM Rowe
- Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK
| | | | | | - Shabhonam Caim
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Alex Shaw
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - Kathleen Sim
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - J. Simon Kroll
- Department of Medicine, Section of Paediatrics, Imperial College London, London, UK
| | - Lindsay J. Hall
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | | | - Martyn D. Winn
- Scientific Computing Department, STFC Daresbury Laboratory, Warrington, UK
| |
Collapse
|
31
|
Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience 2019; 8:5266304. [PMID: 30597002 PMCID: PMC6354030 DOI: 10.1093/gigascience/giy165] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 12/17/2018] [Indexed: 11/23/2022] Open
Abstract
Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Collapse
Affiliation(s)
- Illyoung Choi
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Alise J Ponsero
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Matthew Bomhoff
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Ken Youens-Clark
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - John H Hartman
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Bonnie L Hurwitz
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.,BIO5 Institute, University of Arizona, 1657 E. Helen Street, Tucson, Arizona, 85719, USA
| |
Collapse
|
32
|
Tyakht AV, Manolov AI, Kanygina AV, Ischenko DS, Kovarsky BA, Popenko AS, Pavlenko AV, Elizarova AV, Rakitina DV, Baikova JP, Ladygina VG, Kostryukova ES, Karpova IY, Semashko TA, Larin AK, Grigoryeva TV, Sinyagina MN, Malanin SY, Shcherbakov PL, Kharitonova AY, Khalif IL, Shapina MV, Maev IV, Andreev DN, Belousova EA, Buzunova YM, Alexeev DG, Govorun VM. Genetic diversity of Escherichia coli in gut microbiota of patients with Crohn's disease discovered using metagenomic and genomic analyses. BMC Genomics 2018; 19:968. [PMID: 30587114 PMCID: PMC6307143 DOI: 10.1186/s12864-018-5306-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2018] [Accepted: 11/23/2018] [Indexed: 12/12/2022] Open
Abstract
Background Crohn’s disease is associated with gut dysbiosis. Independent studies have shown an increase in the abundance of certain bacterial species, particularly Escherichia coli with the adherent-invasive pathotype, in the gut. The role of these species in this disease needs to be elucidated. Methods We performed a metagenomic study investigating the gut microbiota of patients with Crohn’s disease. A metagenomic reconstruction of the consensus genome content of the species was used to assess the genetic variability. Results The abnormal shifts in the microbial community structures in Crohn’s disease were heterogeneous among the patients. The metagenomic data suggested the existence of multiple E. coli strains within individual patients. We discovered that the genetic diversity of the species was high and that only a few samples manifested similarity to the adherent-invasive varieties. The other species demonstrated genetic diversity comparable to that observed in the healthy subjects. Our results were supported by a comparison of the sequenced genomes of isolates from the same microbiota samples and a meta-analysis of published gut metagenomes. Conclusions The genomic diversity of Crohn’s disease-associated E. coli within and among the patients paves the way towards an understanding of the microbial mechanisms underlying the onset and progression of the Crohn’s disease and the development of new strategies for the prevention and treatment of this disease. Electronic supplementary material The online version of this article (10.1186/s12864-018-5306-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander V Tyakht
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia. .,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700. .,ITMO University, 49 Kronverkskiy pr, Saint-Petersburg, Russian Federation, 197101.
| | - Alexander I Manolov
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Alexandra V Kanygina
- Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Dmitry S Ischenko
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia.,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Boris A Kovarsky
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Anna S Popenko
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Alexander V Pavlenko
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Anna V Elizarova
- Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Daria V Rakitina
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Julia P Baikova
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Valentina G Ladygina
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Elena S Kostryukova
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia.,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Irina Y Karpova
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Tatyana A Semashko
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia.,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Andrei K Larin
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia
| | - Tatyana V Grigoryeva
- Kazan Federal University, 18 Kremlyovskaya St., Kazan, Russian Federation, 420008
| | - Mariya N Sinyagina
- Kazan Federal University, 18 Kremlyovskaya St., Kazan, Russian Federation, 420008
| | - Sergei Y Malanin
- Kazan Federal University, 18 Kremlyovskaya St., Kazan, Russian Federation, 420008
| | - Petr L Shcherbakov
- Moscow Clinical Scientific Center, 86 Shosse Entuziastov St., Moscow, Russian Federation, 111123
| | - Anastasiya Y Kharitonova
- Clinical and Research Institute of Emergency Children's Surgery and Trauma, 22 Bolshaya Polyanka St., Moscow, Russian Federation, 119180
| | - Igor L Khalif
- State Scientific Center of Coloproctology, 2 Salam Adil St., Moscow, Russian Federation, 123423
| | - Marina V Shapina
- State Scientific Center of Coloproctology, 2 Salam Adil St., Moscow, Russian Federation, 123423
| | - Igor V Maev
- Moscow State University of Medicine and Dentistry, Build. 6, 20 Delegatskaya St., Moscow, Russian Federation, 127473
| | - Dmitriy N Andreev
- Moscow State University of Medicine and Dentistry, Build. 6, 20 Delegatskaya St., Moscow, Russian Federation, 127473
| | - Elena A Belousova
- Moscow Regional Research and Clinical Institute, 61/2 Shchepkina str, Moscow, Russian Federation, 129110
| | - Yulia M Buzunova
- Moscow Regional Research and Clinical Institute, 61/2 Shchepkina str, Moscow, Russian Federation, 129110
| | - Dmitry G Alexeev
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia.,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700
| | - Vadim M Govorun
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435, Russia.,Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, Russian Federation, 141700.,M.M. Shemyakin - Yu.A. Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, 16/10 Miklukho-Maklaya St., Moscow, Russian Federation, 117997
| |
Collapse
|
33
|
Şener DD, Santoni D, Felici G, Oğul H. A Content-Based Retrieval Framework for Whole Metagenome Sequencing Samples. J Integr Bioinform 2018; 15:/j/jib.ahead-of-print/jib-2017-0067/jib-2017-0067.xml. [PMID: 30367805 PMCID: PMC6348744 DOI: 10.1515/jib-2017-0067] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 04/11/2018] [Indexed: 11/15/2022] Open
Abstract
Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.
Collapse
Affiliation(s)
- Duygu Dede Şener
- Başkent University, Faculty of Engineering, Computer Engineering Department, Ankara, Turkey
| | - Daniele Santoni
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Rome, Italy
| | - Giovanni Felici
- Institute of Systems Analysis and Computer Science "A. Ruberti", National Research Council, Rome, Italy
| | - Hasan Oğul
- Başkent University, Faculty of Engineering, Computer Engineering Department, Ankara, Turkey
| |
Collapse
|
34
|
Wang Z, Lou H, Wang Y, Shamir R, Jiang R, Chen T. GePMI: A statistical model for personal intestinal microbiome identification. NPJ Biofilms Microbiomes 2018; 4:20. [PMID: 30210803 PMCID: PMC6123480 DOI: 10.1038/s41522-018-0065-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 07/19/2018] [Accepted: 08/02/2018] [Indexed: 02/07/2023] Open
Abstract
Human gut microbiomes consist of a large number of microbial genomes, which vary by diet and health conditions and from individual to individual. In the present work, we asked whether such variation or similarity could be measured and, if so, whether the results could be used for personal microbiome identification (PMI). To address this question, we herein propose a method to estimate the significance of similarity among human gut metagenomic samples based on reference-free, long k-mer features. Using these features, we find that pairwise similarities between the metagenomes of any two individuals obey a beta distribution and that a p value derived accordingly well characterizes whether two samples are from the same individual or not. We develop a computational framework called GePMI (Generating inter-individual similarity distribution for Personal Microbiome Identification) and apply it to several human gut metagenomic datasets (>300 individuals and >600 samples in total). From the results of GePMI, most of the human gut microbiomes can be identified (auROC = 0.9470, auPRC = 0.8702). Even after antibiotic treatment or fecal microbiota transplantation, the individual k-mer signature still maintains a certain specificity.
Collapse
Affiliation(s)
- Zicheng Wang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNLIST and Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Huazhe Lou
- Bioinformatics Division, BNLIST and Department of Computer Science and Technology, Tsinghua University, 100084 Beijing, China
| | - Ying Wang
- Department of Automation, Xiamen University, 361005 Fujian, China
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNLIST and Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Ting Chen
- Bioinformatics Division, BNLIST and Department of Computer Science and Technology, Tsinghua University, 100084 Beijing, China
| |
Collapse
|
35
|
Hirai M, Nishi S, Tsuda M, Sunamura M, Takaki Y, Nunoura T. Library Construction from Subnanogram DNA for Pelagic Sea Water and Deep-Sea Sediments. Microbes Environ 2017; 32:336-343. [PMID: 29187708 PMCID: PMC5745018 DOI: 10.1264/jsme2.me17132] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Shotgun metagenomics is a low biased technology for assessing environmental microbial diversity and function. However, the requirement for a sufficient amount of DNA and the contamination of inhibitors in environmental DNA leads to difficulties in constructing a shotgun metagenomic library. We herein examined metagenomic library construction from subnanogram amounts of input environmental DNA from subarctic surface water and deep-sea sediments using two library construction kits: the KAPA Hyper Prep Kit and Nextera XT DNA Library Preparation Kit, with several modifications. The influence of chemical contaminants associated with these environmental DNA samples on library construction was also investigated. Overall, shotgun metagenomic libraries were constructed from 1 pg to 1 ng of input DNA using both kits without harsh library microbial contamination. However, the libraries constructed from 1 pg of input DNA exhibited larger biases in GC contents, k-mers, or small subunit (SSU) rRNA gene compositions than those constructed from 10 pg to 1 ng DNA. The lower limit of input DNA for low biased library construction in this study was 10 pg. Moreover, we revealed that technology-dependent biases (physical fragmentation and linker ligation vs. tagmentation) were larger than those due to the amount of input DNA.
Collapse
Affiliation(s)
- Miho Hirai
- Research and Development (R&D) Center for Marine Biosciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
| | - Shinro Nishi
- Research and Development (R&D) Center for Marine Biosciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC).,Ecosystem Observation and Evaluation Methodology Research Unit, Project Team for Development of New-generation Research Protocol for Submarine Resources, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
| | - Miwako Tsuda
- Ecosystem Observation and Evaluation Methodology Research Unit, Project Team for Development of New-generation Research Protocol for Submarine Resources, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
| | - Michinari Sunamura
- Ecosystem Observation and Evaluation Methodology Research Unit, Project Team for Development of New-generation Research Protocol for Submarine Resources, Japan Agency for Marine-Earth Science and Technology (JAMSTEC).,Department of Earth and Planetary Science, The University of Tokyo
| | - Yoshihiro Takaki
- Research and Development (R&D) Center for Marine Biosciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC).,Ecosystem Observation and Evaluation Methodology Research Unit, Project Team for Development of New-generation Research Protocol for Submarine Resources, Japan Agency for Marine-Earth Science and Technology (JAMSTEC).,Department of Subsurface Geobiological Analysis and Research, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
| | - Takuro Nunoura
- Research and Development (R&D) Center for Marine Biosciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC).,Ecosystem Observation and Evaluation Methodology Research Unit, Project Team for Development of New-generation Research Protocol for Submarine Resources, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
| |
Collapse
|
36
|
Dubinkina VB, Tyakht AV, Odintsova VY, Yarygin KS, Kovarsky BA, Pavlenko AV, Ischenko DS, Popenko AS, Alexeev DG, Taraskina AY, Nasyrova RF, Krupitsky EM, Shalikiani NV, Bakulin IG, Shcherbakov PL, Skorodumova LO, Larin AK, Kostryukova ES, Abdulkhakov RA, Abdulkhakov SR, Malanin SY, Ismagilova RK, Grigoryeva TV, Ilina EN, Govorun VM. Links of gut microbiota composition with alcohol dependence syndrome and alcoholic liver disease. MICROBIOME 2017; 5:141. [PMID: 29041989 PMCID: PMC5645934 DOI: 10.1186/s40168-017-0359-2] [Citation(s) in RCA: 283] [Impact Index Per Article: 40.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 10/02/2017] [Indexed: 05/21/2023]
Abstract
BACKGROUND Alcohol abuse has deleterious effects on human health by disrupting the functions of many organs and systems. Gut microbiota has been implicated in the pathogenesis of alcohol-related liver diseases, with its composition manifesting expressed dysbiosis in patients suffering from alcoholic dependence. Due to its inherent plasticity, gut microbiota is an important target for prevention and treatment of these diseases. Identification of the impact of alcohol abuse with associated psychiatric symptoms on the gut community structure is confounded by the liver dysfunction. In order to differentiate the effects of these two factors, we conducted a comparative "shotgun" metagenomic survey of 99 patients with the alcohol dependence syndrome represented by two cohorts-with and without liver cirrhosis. The taxonomic and functional composition of the gut microbiota was subjected to a multifactor analysis including comparison with the external control group. RESULTS Alcoholic dependence and liver cirrhosis were associated with profound shifts in gut community structures and metabolic potential across the patients. The specific effects on species-level community composition were remarkably different between cohorts with and without liver cirrhosis. In both cases, the commensal microbiota was found to be depleted. Alcoholic dependence was inversely associated with the levels of butyrate-producing species from the Clostridiales order, while the cirrhosis-with multiple members of the Bacteroidales order. The opportunist pathogens linked to alcoholic dependence included pro-inflammatory Enterobacteriaceae, while the hallmarks of cirrhosis included an increase of oral microbes in the gut and more frequent occurrence of abnormal community structures. Interestingly, each of the two factors was associated with the expressed enrichment in many Bifidobacterium and Lactobacillus-but the exact set of the species was different between alcoholic dependence and liver cirrhosis. At the level of functional potential, the patients showed different patterns of increase in functions related to alcohol metabolism and virulence factors, as well as pathways related to inflammation. CONCLUSIONS Multiple shifts in the community structure and metabolic potential suggest strong negative influence of alcohol dependence and associated liver dysfunction on gut microbiota. The identified differences in patterns of impact between these two factors are important for planning of personalized treatment and prevention of these pathologies via microbiota modulation. Particularly, the expansion of Bifidobacterium and Lactobacillus suggests that probiotic interventions for patients with alcohol-related disorders using representatives of the same taxa should be considered with caution. Taxonomic and functional analysis shows an increased propensity of the gut microbiota to synthesis of the toxic acetaldehyde, suggesting higher risk of colorectal cancer and other pathologies in alcoholics.
Collapse
Affiliation(s)
- Veronika B. Dubinkina
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
- Department of Bioengineering, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue Urbana, Champaign, IL 61801 USA
- Carl R. Woese Institute for Genomic Biology, 1206 West Gregory Drive, Urbana, IL 61801 USA
| | - Alexander V. Tyakht
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
- ITMO University, Kronverkskiy pr. 49, Saint-Petersburg, 197101 Russia
| | - Vera Y. Odintsova
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Konstantin S. Yarygin
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Boris A. Kovarsky
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Alexander V. Pavlenko
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Dmitry S. Ischenko
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Anna S. Popenko
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Dmitry G. Alexeev
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Anastasiya Y. Taraskina
- Saint-Petersburg Bekhterev Psychoneurological Research Institute, Bekhtereva 3, Saint-Petersburg, 192019 Russia
| | - Regina F. Nasyrova
- Saint-Petersburg Bekhterev Psychoneurological Research Institute, Bekhtereva 3, Saint-Petersburg, 192019 Russia
| | - Evgeny M. Krupitsky
- Saint-Petersburg Bekhterev Psychoneurological Research Institute, Bekhtereva 3, Saint-Petersburg, 192019 Russia
| | - Nino V. Shalikiani
- Moscow Clinical Scientific Center, Shosse Entuziastov 86, Moscow, 111123 Russia
| | - Igor G. Bakulin
- Moscow Clinical Scientific Center, Shosse Entuziastov 86, Moscow, 111123 Russia
| | - Petr L. Shcherbakov
- Moscow Clinical Scientific Center, Shosse Entuziastov 86, Moscow, 111123 Russia
| | - Lyubov O. Skorodumova
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Andrei K. Larin
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Elena S. Kostryukova
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | | | - Sayar R. Abdulkhakov
- Kazan State Medical University, Butlerova 49, Kazan, 420012 Russia
- Kazan Federal University, Kremlyovskaya 18, Kazan, 420008 Russia
| | | | | | | | - Elena N. Ilina
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| | - Vadim M. Govorun
- Moscow Institute of Physics and Technology, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141700 Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine, Malaya Pirogovskaya 1a, Moscow, 119435 Russia
| |
Collapse
|
37
|
Beisser D, Graupner N, Grossmann L, Timm H, Boenigk J, Rahmann S. TaxMapper: an analysis tool, reference database and workflow for metatranscriptome analysis of eukaryotic microorganisms. BMC Genomics 2017; 18:787. [PMID: 29037173 PMCID: PMC5644092 DOI: 10.1186/s12864-017-4168-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 10/05/2017] [Indexed: 12/17/2022] Open
Abstract
Background High-throughput sequencing (HTS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. This approach is at the moment largely limited to prokaryotic communities and communities of few eukaryotic species with sequenced genomes. For eukaryotes the analysis is hindered mainly by a low and fragmented coverage of the reference databases to infer the community composition, but also by lack of automated workflows for the task. Results From the databases of the National Center for Biotechnology Information and Marine Microbial Eukaryote Transcriptome Sequencing Project, 142 references were selected in such a way that the taxa represent the main lineages within each of the seven supergroups of eukaryotes and possess predominantly complete transcriptomes or genomes. From these references, we created an annotated microeukaryotic reference database. We developed a tool called TaxMapper for a reliably mapping of sequencing reads against this database and filtering of unreliable assignments. For filtering, a classifier was trained and tested on each of the following: sequences of taxa in the database, sequences of taxa related to those in the database, and random sequences. Additionally, TaxMapper is part of a metatranscriptomic Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and described in detail to empower researchers to apply it for metatranscriptome analysis of any environmental sample. Conclusions TaxMapper shows superior performance compared to standard approaches, resulting in a higher number of true positive taxonomic assignments. Both the TaxMapper tool and the workflow are available as open-source code at Bitbucket under the MIT license: https://bitbucket.org/dbeisser/taxmapperand as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4168-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daniela Beisser
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany.
| | - Nadine Graupner
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Lars Grossmann
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Henning Timm
- Genome Informatics, University of Duisburg-Essen, University Hospital Essen, Hufelandstr. 55, Essen, 45147, Germany
| | - Jens Boenigk
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Sven Rahmann
- Genome Informatics, University of Duisburg-Essen, University Hospital Essen, Hufelandstr. 55, Essen, 45147, Germany
| |
Collapse
|
38
|
Philips A, Stolarek I, Kuczkowska B, Juras A, Handschuh L, Piontek J, Kozlowski P, Figlerowicz M. Comprehensive analysis of microorganisms accompanying human archaeological remains. Gigascience 2017; 6:1-13. [PMID: 28609785 PMCID: PMC5965364 DOI: 10.1093/gigascience/gix044] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Revised: 05/09/2017] [Accepted: 06/11/2017] [Indexed: 02/01/2023] Open
Abstract
Metagenome analysis has become a common source of information about microbial communities that occupy a wide range of niches, including archaeological specimens. It has been shown that the vast majority of DNA extracted from ancient samples come from bacteria (presumably modern contaminants). However, characterization of microbial DNA accompanying human remains has never been done systematically for a wide range of different samples. We used metagenomic approaches to perform comparative analyses of microorganism communities present in 161 archaeological human remains. DNA samples were isolated from the teeth of human skeletons dated from 100 AD to 1200 AD. The skeletons were collected from 7 archaeological sites in Central Europe and stored under different conditions. The majority of identified microbes were ubiquitous environmental bacteria that most likely contaminated the host remains not long ago. We observed that the composition of microbial communities was sample-specific and not correlated with its temporal or geographical origin. Additionally, traces of bacteria and archaea typical for human oral/gut flora, as well as potential pathogens, were identified in two-thirds of the samples. The genetic material of human-related species, in contrast to the environmental species that accounted for the majority of identified bacteria, displayed DNA damage patterns comparable with endogenous human ancient DNA, which suggested that these microbes might have accompanied the individual before death. Our study showed that the microbiome observed in an individual sample is not reliant on the method or duration of sample storage. Moreover, shallow sequencing of DNA extracted from ancient specimens and subsequent bioinformatics analysis allowed both the identification of ancient microbial species, including potential pathogens, and their differentiation from contemporary species that colonized human remains more recently.
Collapse
Affiliation(s)
- Anna Philips
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
| | - Ireneusz Stolarek
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
| | - Bogna Kuczkowska
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
| | - Anna Juras
- Department of Human Evolutionary Biology, Institute of Anthropology, Faculty
of Biology, Adam Mickiewicz University in Poznan, Poznan, 61-614, Poland
| | - Luiza Handschuh
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
- Department of Hematology and Bone Marrow Transplantation, University of
Medical Sciences, Poznan, 60-569, Poland
- Institute of Technology and Chemical Engineering, Poznan University of
Technology, Poznan, 60-965, Poland
| | - Janusz Piontek
- Department of Human Evolutionary Biology, Institute of Anthropology, Faculty
of Biology, Adam Mickiewicz University in Poznan, Poznan, 61-614, Poland
| | - Piotr Kozlowski
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
- Institute of Technology and Chemical Engineering, Poznan University of
Technology, Poznan, 60-965, Poland
| | - Marek Figlerowicz
- European Center for Bioinformatics and Genomics, Institute of Bioorganic
Chemistry, Polish Academy of Sciences, Poznan, 61-704, Poland
- Institute of Computing Science, Poznan University of Technology, Poznan,
60-965, Poland
| |
Collapse
|
39
|
Forsdyke DR. Base Composition, Speciation, and Why the Mitochondrial Barcode Precisely Classifies. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/s13752-017-0267-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
40
|
Liu S, Zheng J, Migeon P, Ren J, Hu Y, He C, Liu H, Fu J, White FF, Toomajian C, Wang G. Unbiased K-mer Analysis Reveals Changes in Copy Number of Highly Repetitive Sequences During Maize Domestication and Improvement. Sci Rep 2017; 7:42444. [PMID: 28186206 PMCID: PMC5301235 DOI: 10.1038/srep42444] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Accepted: 01/10/2017] [Indexed: 12/15/2022] Open
Abstract
The major component of complex genomes is repetitive elements, which remain recalcitrant to characterization. Using maize as a model system, we analyzed whole genome shotgun (WGS) sequences for the two maize inbred lines B73 and Mo17 using k-mer analysis to quantify the differences between the two genomes. Significant differences were identified in highly repetitive sequences, including centromere, 45S ribosomal DNA (rDNA), knob, and telomere repeats. Genotype specific 45S rDNA sequences were discovered. The B73 and Mo17 polymorphic k-mers were used to examine allele-specific expression of 45S rDNA in the hybrids. Although Mo17 contains higher copy number than B73, equivalent levels of overall 45S rDNA expression indicates that transcriptional or post-transcriptional regulation mechanisms operate for the 45S rDNA in the hybrids. Using WGS sequences of B73xMo17 doubled haploids, genomic locations showing differential repetitive contents were genetically mapped, which displayed different organization of highly repetitive sequences in the two genomes. In an analysis of WGS sequences of HapMap2 lines, including maize wild progenitor, landraces, and improved lines, decreases and increases in abundance of additional sets of k-mers associated with centromere, 45S rDNA, knob, and retrotransposons were found among groups, revealing global evolutionary trends of genomic repeats during maize domestication and improvement.
Collapse
Affiliation(s)
- Sanzhen Liu
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 66506, USA
| | - Jun Zheng
- Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
| | - Pierre Migeon
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 66506, USA
| | - Jie Ren
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 66506, USA
| | - Ying Hu
- Department of Plant Pathology, Kansas State University, Manhattan, KS, 66506, USA
| | - Cheng He
- Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
| | - Hongjun Liu
- State Key Laboratory of Crop Biology, Shandong Key Laboratory of Crop Biology, Taian 271018, P.R. China.,College of Life Sciences, Shandong Agricultural University, Taian 271018, P.R. China
| | - Junjie Fu
- Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
| | - Frank F White
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | | | - Guoying Wang
- Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
| |
Collapse
|
41
|
Rosani U, Gerdol M. A bioinformatics approach reveals seven nearly-complete RNA-virus genomes in bivalve RNA-seq data. Virus Res 2016; 239:33-42. [PMID: 27769778 DOI: 10.1016/j.virusres.2016.10.009] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Revised: 10/17/2016] [Accepted: 10/17/2016] [Indexed: 01/17/2023]
Abstract
Viral metagenomics (viromics) can provide a great contribution in expanding the knowledge of viruses and the relationship with their hosts. Viromic studies on marine organisms are still at a very early stage and only little efforts have been spent in the identification of viruses associated to marine invertebrates to date, leaving the complexity of marine viromes associated to bivalve hosts almost completely unexplored. However, the potential use of viromic approaches in the management of viral diseases affecting aquacultured species has been recently evidenced by the flourishing of studies on the Ostreid herpesvirus type-1, which has been associated with bivalve mortality events. Herein we discuss an effective pipeline to retrieve and reconstruct nearly complete and previously unreported viral genomes from existing host RNA-seq data. As a case study, we report the identification of seven RNA-virus genomes within the frame of a highly diversified viral community that characterizes both Crassostrea gigas and Mytilus galloprovincialis samples collected from the lagoon of Goro (Italy).
Collapse
Affiliation(s)
- Umberto Rosani
- Dept. of Biology, University of Padua, Via U. Bassi 58/B, 35121 Padova Italy.
| | - Marco Gerdol
- Dept. of Life Sciences, University of Trieste, Via L. Giorgieri 5, 34127 Trieste Italy
| |
Collapse
|