1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Park A, Koslicki D. Prokrustean Graph: A substring index for rapid k-mer size analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568151. [PMID: 38853857 PMCID: PMC11160577 DOI: 10.1101/2023.11.21.568151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Despite the widespread adoption of k -mer-based methods in bioinformatics, understanding the influence of k -mer sizes remains a persistent challenge. Selecting an optimal k -mer size or employing multiple k -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k -mer-based object like Jaccard Similarity, de Bruijn graphs, k -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k -mer sizes, the dynamics of k -mer-based objects with respect to k -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k -mer-based objects across k -mer sizes, utilizing a novel substring index, the Pro k rustean graph. The primary contribution of this framework is to compute quantities associated with k -mer-based objects for all k -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k -mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1, …, 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Pro k rustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean .
Collapse
|
3
|
Chen Z, Zhang L, Lv Y, Qu S, Liu W, Wang K, Gao S, Zhu F, Cao B, Xu K. A genome assembly of ginger (Zingiber officinale Roscoe) provides insights into genome evolution and 6-gingerol biosynthesis. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 118:682-695. [PMID: 38251816 DOI: 10.1111/tpj.16625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 12/12/2023] [Accepted: 12/22/2023] [Indexed: 01/23/2024]
Abstract
Ginger is cultivated in tropical and subtropical regions and is one of the most crucial spices worldwide owing to its special taste and scent. Here, we present a high-quality genome assembly for 'Small Laiwu Ginger', a famous cultivated ginger in northern China. The ginger genome was phased into two haplotypes, haplotype A (1.55Gb), and haplotype B (1.44Gb). Analysis of Ty1/Copia and Ty3/Gypsy LTR retrotransposon families revealed that both have undergone multiple retrotransposon bursts about 0-1 million years ago. In addition to a recent whole-genome duplication event, there has been a lineage-specific expansion of genes involved in stilbenoid, diarylheptanoid, and gingerol biosynthesis, thereby enhancing 6-gingerol biosynthesis. Furthermore, we focused on the biosynthesis of 6-gingerol, the most important gingerol, and screened key transcription factors ZoMYB106 and ZobHLH148 that regulate 6-gingerol synthesis by transcriptomic and metabolomic analysis in the ginger rhizome at four growth stages. The results of yeast one-hybrid, electrophoretic mobility shift, and dual-luciferase reporter gene assays showed that both ZoMYB106 and ZobHLH148 bind to the promoters of the key rate-limiting enzyme genes ZoCCOMT1 and ZoCCOMT2 in the 6-gingerol synthesis pathway and promote their transcriptional activities. The reference genome, transcriptome, and metabolome data pave the way for further research on the molecular mechanism underlying the biosynthesis of 6-gingerol. Furthermore, it provides precious new resources for the study on the biology and molecular breeding of ginger.
Collapse
Affiliation(s)
- Zijing Chen
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| | - Ling Zhang
- Laiwu Municipal Agriculture Bureau in Shandong, Jinan, P. R. China
| | - Yao Lv
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| | - Shenyang Qu
- Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, P. R. China
| | - Wenjun Liu
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| | - Kai Wang
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| | - Song Gao
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
- College of Horticulture and Landscape Architecture, Yaozhou University, Yangzhou, P. R. China
| | - Feng Zhu
- Laiwu Municipal Agriculture Bureau in Shandong, Jinan, P. R. China
| | - Bili Cao
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| | - Kun Xu
- College of Horticulture Science and Engineering, Shandong Agricultural University, Tai'an, Shandong, P. R. China
- Key Laboratory of Biology and Genetic Improvement of Horticultural Crops in Huanghuai Region, Ministry of Agriculture and Rural Affairs, Taian, P. R. China
| |
Collapse
|
4
|
KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis. ALGORITHMS 2022. [DOI: 10.3390/a15040107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.
Collapse
|