1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Mouratidis I, Chantzi N, Khan U, Konnaris MA, Chan CSY, Mareboina M, Moeckel C, Georgakopoulos-Soares I. Frequentmers - a novel way to look at metagenomic next generation sequencing data and an application in detecting liver cirrhosis. BMC Genomics 2023; 24:768. [PMID: 38087204 PMCID: PMC10714505 DOI: 10.1186/s12864-023-09861-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce "frequentmers"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| | - Nikol Chantzi
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Umair Khan
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA
| | - Maxwell A Konnaris
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
- Department of Statistics, Penn State, University Park, PA, USA
- Huck Institutes of the Life Sciences, Penn State, University Park, PA, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Manvita Mareboina
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Camille Moeckel
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| |
Collapse
|
3
|
Ruperao P, Gandham P, Odeny DA, Mayes S, Selvanayagam S, Thirunavukkarasu N, Das RR, Srikanda M, Gandhi H, Habyarimana E, Manyasa E, Nebie B, Deshpande SP, Rathore A. Exploring the sorghum race level diversity utilizing 272 sorghum accessions genomic resources. FRONTIERS IN PLANT SCIENCE 2023; 14:1143512. [PMID: 37008459 PMCID: PMC10063887 DOI: 10.3389/fpls.2023.1143512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 02/22/2023] [Indexed: 06/19/2023]
Abstract
Due to evolutionary divergence, sorghum race populations exhibit significant genetic and morphological variation. A k-mer-based sorghum race sequence comparison identified the conserved k-mers of all 272 accessions from sorghum and the race-specific genetic signatures identified the gene variability in 10,321 genes (PAVs). To understand sorghum race structure, diversity and domestication, a deep learning-based variant calling approach was employed in a set of genotypic data derived from a diverse panel of 272 sorghum accessions. The data resulted in 1.7 million high-quality genome-wide SNPs and identified selective signature (both positive and negative) regions through a genome-wide scan with different (iHS and XP-EHH) statistical methods. We discovered 2,370 genes associated with selection signatures including 179 selective sweep regions distributed over 10 chromosomes. Co-localization of these regions undergoing selective pressure with previously reported QTLs and genes revealed that the signatures of selection could be related to the domestication of important agronomic traits such as biomass and plant height. The developed k-mer signatures will be useful in the future to identify the sorghum race and for trait and SNP markers for assisting in plant breeding programs.
Collapse
Affiliation(s)
- Pradeep Ruperao
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Prasad Gandham
- School of Plant, Environmental and Soil Sciences, Louisiana State University Agricultural Center, LA, United States
| | - Damaris A. Odeny
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Sean Mayes
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | | | - Nepolean Thirunavukkarasu
- Genomics and Molecular Breeding Lab, Indian Council of Agricultural Research (ICAR) - Indian Institute of Millets Research, Hyderabad, India
| | - Roma R. Das
- International Crops Research Institute for the Semi-Arid Tropics, Hyderabad, India
| | - Manasa Srikanda
- Department of Statistics, Osmania University, Hyderabad, India
| | - Harish Gandhi
- International Maize and Wheat Improvement Center (CIMMYT), Nairobi, Kenya
| | - Ephrem Habyarimana
- International Crops Research Institute for the Semi-Arid Tropics, Hyderabad, India
| | - Eric Manyasa
- Sorghum Breeding Program, International Crops Research Institute for the Semi-Arid Tropics, Nairobi, Kenya
| | - Baloua Nebie
- International Maize and Wheat Improvement Center (CIMMYT), Dakar, Senegal
| | | | - Abhishek Rathore
- Excellence in Breeding, International Maize and Wheat Improvement Center (CIMMYT), Hyderabad, India
| |
Collapse
|
4
|
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G. Framing Apache Spark in life sciences. Heliyon 2023; 9:e13368. [PMID: 36852030 PMCID: PMC9958288 DOI: 10.1016/j.heliyon.2023.e13368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 01/19/2023] [Accepted: 01/29/2023] [Indexed: 02/11/2023] Open
Abstract
Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Matteo Gnocchi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Luciano Milanesi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Osvaldo Marullo
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| | - Giuliano Armano
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| |
Collapse
|
5
|
Zheng Y, Shi J, Chen Q, Deng C, Yang F, Wang Y. Identifying individual-specific microbial DNA fingerprints from skin microbiomes. Front Microbiol 2022; 13:960043. [PMID: 36274714 PMCID: PMC9583911 DOI: 10.3389/fmicb.2022.960043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open
Abstract
Skin is an important ecosystem that links the human body and the external environment. Previous studies have shown that the skin microbial community could remain stable, even after long-term exposure to the external environment. In this study, we explore two questions: Do there exist strains or genetic variants in skin microorganisms that are individual-specific, temporally stable, and body site-independent? And if so, whether such microorganismal genetic variants could be used as markers, called “fingerprints” in our study, to identify donors? We proposed a framework to capture individual-specific DNA microbial fingerprints from skin metagenomic sequencing data. The fingerprints are identified on the frequency of 31-mers free from reference genomes and sequence alignments. The 616 metagenomic samples from 17 skin sites at 3-time points from 12 healthy individuals from Integrative Human Microbiome Project were adopted. Ultimately, one contig for each individual is assembled as a fingerprint. And results showed that 89.78% of the skin samples despite body sites could identify their donors correctly. It is observed that 10 out of 12 individual-specific fingerprints could be aligned to Cutibacterium acnes. Our study proves that the identified fingerprints are temporally stable, body site-independent, and individual-specific, and can identify their donors with enough accuracy. The source code of the genetic identification framework is freely available at https://github.com/Ying-Lab/skin_fingerprint.
Collapse
Affiliation(s)
- Yiluan Zheng
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Jianlu Shi
- Stomatological Hospital of Xiamen Medical College, Xiamen, China
- Xiamen Key Laboratory of Stomatological Disease Diagnosis and Treatment, Xiamen, China
| | - Qi Chen
- Department of Automation, Xiamen University, Xiamen, China
| | - Chao Deng
- Department of Automation, Xiamen University, Xiamen, China
| | - Fan Yang
- Department of Automation, Xiamen University, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
- *Correspondence: Ying Wang
| |
Collapse
|
6
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
7
|
Sun Q, Peng Y, Liu J. A reference-free approach for cell type classification with scRNA-seq. iScience 2021; 24:102855. [PMID: 34381979 PMCID: PMC8335627 DOI: 10.1016/j.isci.2021.102855] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 05/07/2021] [Accepted: 07/08/2021] [Indexed: 11/29/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a revolutionary technology to characterize cells under different biological conditions. Unlike bulk RNA-seq, gene expression from scRNA-seq is highly sparse due to limited sequencing depth per cell. This is worsened by tossing away a significant portion of reads that attribute to gene quantification. To overcome data sparsity and fully utilize original reads, we propose scSimClassify, a reference-free and alignment-free approach to classify cell types with k-mer level features. The compressed k-mer groups (CKGs), identified by the simhash method, contain k-mers with similar abundance profiles and serve as the cells’ features. Our experiments demonstrate that CKG features lend themselves to better performance than gene expression features in scRNA-seq classification accuracy in the majority of experimental cases. Because CKGs are derived from raw reads without alignment to reference genome, scSimClassify offers an effective alternative to existing methods especially when reference genome is incomplete or insufficient to represent subject genomes. Compressed k-mer groups (CKGs) are used to classify cell types without references CKGs are competitive to gene expression features for cell type classification CKGs are associated with genes sharing gene specific k-mers
Collapse
Affiliation(s)
- Qi Sun
- Department of Computer Science, University of Kentucky, Lexington, KY, 40508, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jinze Liu
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA 23298, USA
| |
Collapse
|
8
|
Hou Y, Zhang X, Zhou Q, Hong W, Wang Y. Hierarchical Microbial Functions Prediction by Graph Aggregated Embedding. Front Genet 2021; 11:608512. [PMID: 33584804 PMCID: PMC7874084 DOI: 10.3389/fgene.2020.608512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 11/20/2020] [Indexed: 02/01/2023] Open
Abstract
Matching 16S rRNA gene sequencing data to a metabolic reference database is a meaningful way to predict the metabolic function of bacteria and archaea, bringing greater insight to the working of the microbial community. However, some operational taxonomy units (OTUs) cannot be functionally profiled, especially for microbial communities from non-human samples cultured in defective media. Therefore, we herein report the development of Hierarchical micrObial functions Prediction by graph aggregated Embedding (HOPE), which utilizes co-occurring patterns and nucleotide sequences to predict microbial functions. HOPE integrates topological structures of microbial co-occurrence networks with k-mer compositions of OTU sequences and embeds them into a lower-dimensional continuous latent space, while maximally preserving topological relationships among OTUs. The high imbalance among KEGG Orthology (KO) functions of microbes is recognized in our framework that usually yields poor performance. A hierarchical multitask learning module is used in HOPE to alleviate the challenge brought by the long-tailed distribution among classes. To test the performance of HOPE, we compare it with HOPE-one, HOPE-seq, and GraphSAGE, respectively, in three microbial metagenomic 16s rRNA sequencing datasets, including abalone gut, human gut, and gut of Penaeus monodon. Experiments demonstrate that HOPE outperforms baselines on almost all indexes in all experiments. Furthermore, HOPE reveals significant generalization ability. HOPE's basic idea is suitable for other related scenarios, such as the prediction of gene function based on gene co-expression networks. The source code of HOPE is freely available at https://github.com/adrift00/HOPE.
Collapse
Affiliation(s)
- Yujie Hou
- Department of Automation, Xiamen University, Xiamen, China.,Department of Automation, University of Science and Technology of China, Hefei, China
| | - Xiong Zhang
- Department of Automation, Xiamen University, Xiamen, China.,School of Automation Science and Engineering, South China University of Technology, Guangzhou, China
| | - Qinyan Zhou
- Department of Automation, Xiamen University, Xiamen, China.,Institute of AI and Robotics, Fudan University, Shanghai, China
| | - Wenxing Hong
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China.,Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
| |
Collapse
|
9
|
Wang Y, Chen Q, Deng C, Zheng Y, Sun F. KmerGO: A Tool to Identify Group-Specific Sequences With k-mers. Front Microbiol 2020; 11:2067. [PMID: 32983048 PMCID: PMC7477287 DOI: 10.3389/fmicb.2020.02067] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Accepted: 08/06/2020] [Indexed: 01/24/2023] Open
Abstract
Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a “group-specific” sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific k-mers (k up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific k-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at https://github.com/ChnMasterOG/KmerGO.
Collapse
Affiliation(s)
- Ying Wang
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision-Making, Xiamen, China
| | - Qi Chen
- Department of Automation, Xiamen University, Xiamen, China
| | - Chao Deng
- Department of Automation, Xiamen University, Xiamen, China
| | - Yiluan Zheng
- Department of Automation, Xiamen University, Xiamen, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|