1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Ponsero AJ, Miller M, Hurwitz BL. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. MICROBIOME RESEARCH REPORTS 2023; 2:27. [PMID: 38058765 PMCID: PMC10696585 DOI: 10.20517/mrr.2023.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/28/2023] [Accepted: 07/12/2023] [Indexed: 12/08/2023]
Abstract
Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.
Collapse
Affiliation(s)
- Alise Jany Ponsero
- Human Microbiome Research Program, University of Helsinki, Helsinki 00290, Finland
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| | - Matthew Miller
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
| | - Bonnie Louise Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
3
|
Karikari B, Lemay MA, Belzile F. k-mer-Based Genome-Wide Association Studies in Plants: Advances, Challenges, and Perspectives. Genes (Basel) 2023; 14:1439. [PMID: 37510343 PMCID: PMC10379394 DOI: 10.3390/genes14071439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/04/2023] [Accepted: 07/07/2023] [Indexed: 07/30/2023] Open
Abstract
Genome-wide association studies (GWAS) have allowed the discovery of marker-trait associations in crops over recent decades. However, their power is hampered by a number of limitations, with the key one among them being an overreliance on single-nucleotide polymorphisms (SNPs) as molecular markers. Indeed, SNPs represent only one type of genetic variation and are usually derived from alignment to a single genome assembly that may be poorly representative of the population under study. To overcome this, k-mer-based GWAS approaches have recently been developed. k-mer-based GWAS provide a universal way to assess variation due to SNPs, insertions/deletions, and structural variations without having to specifically detect and genotype these variants. In addition, k-mer-based analyses can be used in species that lack a reference genome. However, the use of k-mers for GWAS presents challenges such as data size and complexity, lack of standard tools, and potential detection of false associations. Nevertheless, efforts are being made to overcome these challenges and a general analysis workflow has started to emerge. We identify the priorities for k-mer-based GWAS in years to come, notably in the development of user-friendly programs for their analysis and approaches for linking significant k-mers to sequence variation.
Collapse
Affiliation(s)
- Benjamin Karikari
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
- Department of Agricultural Biotechnology, Faculty of Agriculture, Food and Consumer Sciences, University for Development Studies, Tamale P.O. Box TL 1882, Ghana
| | - Marc-André Lemay
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Quebec City, QC G1V 0A6, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| |
Collapse
|
4
|
Zheng Y, Shi J, Chen Q, Deng C, Yang F, Wang Y. Identifying individual-specific microbial DNA fingerprints from skin microbiomes. Front Microbiol 2022; 13:960043. [PMID: 36274714 PMCID: PMC9583911 DOI: 10.3389/fmicb.2022.960043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open
Abstract
Skin is an important ecosystem that links the human body and the external environment. Previous studies have shown that the skin microbial community could remain stable, even after long-term exposure to the external environment. In this study, we explore two questions: Do there exist strains or genetic variants in skin microorganisms that are individual-specific, temporally stable, and body site-independent? And if so, whether such microorganismal genetic variants could be used as markers, called “fingerprints” in our study, to identify donors? We proposed a framework to capture individual-specific DNA microbial fingerprints from skin metagenomic sequencing data. The fingerprints are identified on the frequency of 31-mers free from reference genomes and sequence alignments. The 616 metagenomic samples from 17 skin sites at 3-time points from 12 healthy individuals from Integrative Human Microbiome Project were adopted. Ultimately, one contig for each individual is assembled as a fingerprint. And results showed that 89.78% of the skin samples despite body sites could identify their donors correctly. It is observed that 10 out of 12 individual-specific fingerprints could be aligned to Cutibacterium acnes. Our study proves that the identified fingerprints are temporally stable, body site-independent, and individual-specific, and can identify their donors with enough accuracy. The source code of the genetic identification framework is freely available at https://github.com/Ying-Lab/skin_fingerprint.
Collapse
Affiliation(s)
- Yiluan Zheng
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Jianlu Shi
- Stomatological Hospital of Xiamen Medical College, Xiamen, China
- Xiamen Key Laboratory of Stomatological Disease Diagnosis and Treatment, Xiamen, China
| | - Qi Chen
- Department of Automation, Xiamen University, Xiamen, China
| | - Chao Deng
- Department of Automation, Xiamen University, Xiamen, China
| | - Fan Yang
- Department of Automation, Xiamen University, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
- *Correspondence: Ying Wang
| |
Collapse
|
5
|
Walker K, Kalra D, Lowdon R, Chen G, Molik D, Soto DC, Dabbaghie F, Khleifat AA, Mahmoud M, Paulin LF, Raza MS, Pfeifer SP, Agustinho DP, Aliyev E, Avdeyev P, Barrozo ER, Behera S, Billingsley K, Chong LC, Choubey D, De Coster W, Fu Y, Gener AR, Hefferon T, Henke DM, Höps W, Illarionova A, Jochum MD, Jose M, Kesharwani RK, Kolora SRR, Kubica J, Lakra P, Lattimer D, Liew CS, Lo BW, Lo C, Lötter A, Majidian S, Mendem SK, Mondal R, Ohmiya H, Parvin N, Peralta C, Poon CL, Prabhakaran R, Saitou M, Sammi A, Sanio P, Sapoval N, Syed N, Treangen T, Wang G, Xu T, Yang J, Zhang S, Zhou W, Sedlazeck FJ, Busby B. The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms. F1000Res 2022; 11:530. [PMID: 36262335 PMCID: PMC9557141 DOI: 10.12688/f1000research.110194.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/04/2022] [Indexed: 01/25/2023] Open
Abstract
In October 2021, 59 scientists from 14 countries and 13 U.S. states collaborated virtually in the Third Annual Baylor College of Medicine & DNANexus Structural Variation hackathon. The goal of the hackathon was to advance research on structural variants (SVs) by prototyping and iterating on open-source software. This led to nine hackathon projects focused on diverse genomics research interests, including various SV discovery and genotyping methods, SV sequence reconstruction, and clinically relevant structural variation, including SARS-CoV-2 variants. Repositories for the projects that participated in the hackathon are available at https://github.com/collaborativebioinformatics.
Collapse
Affiliation(s)
- Kimberly Walker
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA,
| | - Divya Kalra
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA,
| | | | - Guangyi Chen
- Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbrücken, Germany,Center for Bioinformatics, Saarland University, Saarbrücken, Germany,
| | - David Molik
- Tropical Crop and Commodity Protection Research Unit, Pacific Basin Agricultural Research Center, Hilo, HI, 96720, USA
| | - Daniela C. Soto
- Biochemistry & Molecular Medicine, Genome Center, MIND Institute, University of California, Davis, Davis, CA, 95616, USA
| | - Fawaz Dabbaghie
- Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbrücken, Germany,Institute for Medical Biometry and Bioinformatics, University hospital Düsseldorf, Düsseldorf, Germany
| | - Ahmad Al Khleifat
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Muhammad Sohail Raza
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Beijing, China
| | - Susanne P. Pfeifer
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ, USA
| | - Daniel Paiva Agustinho
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, MO, 63110, USA
| | - Elbay Aliyev
- Research Department, Sidra Medicine, Doha, Qatar
| | - Pavel Avdeyev
- Computational Biology Institute, The George Washington University, Washington, DC, 20052, USA
| | - Enrico R. Barrozo
- Department of Obstetrics & Gynecology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Kimberley Billingsley
- Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Li Chuin Chong
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Istanbul, Turkey
| | - Deepak Choubey
- Department of Technology, Savitribai Phule Pune University, Pune, Maharashtra, India
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, Antwerp, Belgium,Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Yilei Fu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Alejandro R. Gener
- Association of Public Health Labs, Centers for Disease Control and Prevention, Downey, CA, USA
| | - Timothy Hefferon
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA
| | - David Morgan Henke
- Department Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Wolfram Höps
- EMBL Heidelberg, Genome Biology Unit, Heidelberg, Germany
| | | | - Michael D. Jochum
- Department of Obstetrics & Gynecology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Maria Jose
- Centre for Bioinformatics, Pondicherry University, Pondicherry, India
| | - Rupesh K. Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | | | | | - Priya Lakra
- Department of Zoology, University of Delhi, Delhi, India
| | - Damaris Lattimer
- University of Applied Sciences Upper Austria - FH Hagenberg, Mühlkreis, Austria
| | - Chia-Sin Liew
- Center for Biotechnology, University of Nebraska-Lincoln, Lincoln, Nebraska, 68588, USA
| | - Bai-Wei Lo
- Department of Biology, University of Konstanz, Konstanz, Germany
| | - Chunhsuan Lo
- Human Genetics Laboratory, National Institute of Genetics, Japan, Mishima City, Japan
| | - Anneri Lötter
- Department of Biochemistry, University of Pretoria, Pretoria, South Africa
| | - Sina Majidian
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | | | - Rajarshi Mondal
- Department of Biotechnology, The University of Burdwan, West Bengal, India
| | - Hiroko Ohmiya
- Genetic Reagent Development Unit, Medical & Biological Laboratories Co., Ltd., Tokoyo, Japan
| | - Nasrin Parvin
- Department of Biotechnology, The University of Burdwan, West Bengal, India
| | | | | | | | - Marie Saitou
- Center of Integrative Genetics (CIGENE),Faculty of Biosciences, Norwegian University of Life Sciences, As, Norway
| | - Aditi Sammi
- School of Biochemical Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
| | - Philippe Sanio
- University of Applied Sciences Upper Austria - FH Hagenberg, Hagenberg im Mühlkreis, Austria
| | - Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Najeeb Syed
- Research Department, Sidra Medicine, Doha, Qatar
| | - Todd Treangen
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Tiancheng Xu
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Jianzhi Yang
- Department of Quantitative and Computational Biology,, University of Southern California, Los Angeles, CA, USA
| | - Shangzhe Zhang
- School of Biology, University of St Andrews, St Andrews, UK
| | - Weiyu Zhou
- Department of Statistical Science, George Mason University, Fairfax, Virginia, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA,
| | | |
Collapse
|
6
|
Ji YT, Xiu Z, Chen CH, Wang Y, Yang JX, Sui JJ, Jiang SJ, Wang P, Yue SY, Zhang QQ, Jin JL, Wang GS, Wei QQ, Wei B, Wang J, Zhang HL, Zhang QY, Liu J, Liu CJ, Jian JB, Qu CQ. Long read sequencing of Toona sinensis (A. Juss) Roem: A chromosome-level reference genome for the family Meliaceae. Mol Ecol Resour 2021; 21:1243-1255. [PMID: 33421343 DOI: 10.1111/1755-0998.13318] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 12/21/2020] [Accepted: 01/05/2021] [Indexed: 11/30/2022]
Abstract
Chinese mahogany (Toona sinensis) is a woody plant that is widely cultivated in China and Malaysia. Toona sinensis is important economically, including as a nutritious food source, as material for traditional Chinese medicine and as a high-quality hardwood. However, the absence of a reference genome has hindered in-depth molecular and evolutionary studies of this plant. In this study, we report a high-quality T. sinensis genome assembly, with scaffolds anchored to 28 chromosomes and a total assembled length of 596 Mb (contig N50 = 1.5 Mb and scaffold N50 = 21.5 Mb). A total of 34,345 genes were predicted in the genome after homology-based and de novo annotation analyses. Evolutionary analysis showed that the genomes of T. sinensis and Populus trichocarpa diverged ~99.1-103.1 million years ago, and the T. sinensis genome underwent a recent genome-wide duplication event at ~7.8 million years and one more ancient whole genome duplication event at ~71.5 million years. These results provide a high-quality chromosome-level reference genome for T. sinensis and confirm its evolutionary position at the genomic level. Such information will offer genomic resources to study the molecular mechanism of terpenoid biosynthesis and the formation of flavour compounds, which will further facilitate its molecular breeding. As the first chromosome-level genome assembled in the family Meliaceae, it will provide unique insights into the evolution of members of the Meliaceae.
Collapse
Affiliation(s)
- Yun-Tao Ji
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | - Zhihui Xiu
- BGI Genomics, BGI-Shenzhen, Shenzhen, China
| | | | - Youru Wang
- Hubei Engineering Research Center of Typical Wild Vegetables Breeding and Comprehensive Utilization Technology, Hubei Normal University, Huangshi, China
| | - Jing-Xia Yang
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | - Juan-Juan Sui
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | | | - Ping Wang
- BGI Genomics, BGI-Shenzhen, Shenzhen, China
| | - Shao-Yun Yue
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | | | - Ji-Liang Jin
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | | | | | - Bing Wei
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | - Juan Wang
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | | | - Qiu-Yan Zhang
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| | - Jun Liu
- Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Fuyang, China
| | - Chang-Jin Liu
- State Key Laboratory of Food Nutrition and Safety, School of Food Science and Technology, Tianjin University of Science and Technology, Tianjin, China
| | - Jian-Bo Jian
- BGI Genomics, BGI-Shenzhen, Shenzhen, China.,Department of Biotechnology and Biomedicine, Technical University of Denmark, Lyngby, Denmark.,Key Laboratory of Genomics, Ministry of Agriculture, BGI-Shenzhen, Shenzhen, China
| | - Chang-Qing Qu
- Engineering Technology Research Center of Anti-aging Chinese Herbal Medicine of Anhui Province, Biology and Food Engineering School, Fuyang Normal University, Fuyang, China
| |
Collapse
|
7
|
Hou Y, Zhang X, Zhou Q, Hong W, Wang Y. Hierarchical Microbial Functions Prediction by Graph Aggregated Embedding. Front Genet 2021; 11:608512. [PMID: 33584804 PMCID: PMC7874084 DOI: 10.3389/fgene.2020.608512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 11/20/2020] [Indexed: 02/01/2023] Open
Abstract
Matching 16S rRNA gene sequencing data to a metabolic reference database is a meaningful way to predict the metabolic function of bacteria and archaea, bringing greater insight to the working of the microbial community. However, some operational taxonomy units (OTUs) cannot be functionally profiled, especially for microbial communities from non-human samples cultured in defective media. Therefore, we herein report the development of Hierarchical micrObial functions Prediction by graph aggregated Embedding (HOPE), which utilizes co-occurring patterns and nucleotide sequences to predict microbial functions. HOPE integrates topological structures of microbial co-occurrence networks with k-mer compositions of OTU sequences and embeds them into a lower-dimensional continuous latent space, while maximally preserving topological relationships among OTUs. The high imbalance among KEGG Orthology (KO) functions of microbes is recognized in our framework that usually yields poor performance. A hierarchical multitask learning module is used in HOPE to alleviate the challenge brought by the long-tailed distribution among classes. To test the performance of HOPE, we compare it with HOPE-one, HOPE-seq, and GraphSAGE, respectively, in three microbial metagenomic 16s rRNA sequencing datasets, including abalone gut, human gut, and gut of Penaeus monodon. Experiments demonstrate that HOPE outperforms baselines on almost all indexes in all experiments. Furthermore, HOPE reveals significant generalization ability. HOPE's basic idea is suitable for other related scenarios, such as the prediction of gene function based on gene co-expression networks. The source code of HOPE is freely available at https://github.com/adrift00/HOPE.
Collapse
Affiliation(s)
- Yujie Hou
- Department of Automation, Xiamen University, Xiamen, China.,Department of Automation, University of Science and Technology of China, Hefei, China
| | - Xiong Zhang
- Department of Automation, Xiamen University, Xiamen, China.,School of Automation Science and Engineering, South China University of Technology, Guangzhou, China
| | - Qinyan Zhou
- Department of Automation, Xiamen University, Xiamen, China.,Institute of AI and Robotics, Fudan University, Shanghai, China
| | - Wenxing Hong
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China.,Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
| |
Collapse
|