1
|
Premeaux TA, Bowler S, Friday CM, Moser CB, Hoenigl M, Lederman MM, Landay AL, Gianella S, Ndhlovu LC. Machine learning models based on fluid immunoproteins that predict non-AIDS adverse events in people with HIV. iScience 2024; 27:109945. [PMID: 38812553 PMCID: PMC11134891 DOI: 10.1016/j.isci.2024.109945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 03/12/2024] [Accepted: 05/06/2024] [Indexed: 05/31/2024] Open
Abstract
Despite the success of antiretroviral therapy (ART), individuals with HIV remain at risk for experiencing non-AIDS adverse events (NAEs), including cardiovascular complications and malignancy. Several surrogate immune biomarkers in blood have shown predictive value in predicting NAEs; however, composite panels generated using machine learning may provide a more accurate advancement for monitoring and discriminating NAEs. In a nested case-control study, we aimed to develop machine learning models to discriminate cases (experienced an event) and matched controls using demographic and clinical characteristics alongside 49 plasma immunoproteins measured prior to and post-ART initiation. We generated support vector machine (SVM) classifier models for high-accuracy discrimination of individuals aged 30-50 years who experienced non-fatal NAEs at pre-ART and one-year post-ART. Extreme gradient boosting generated a high-accuracy model at pre-ART, while K-nearest neighbors performed poorly all around. SVM modeling may offer guidance to improve disease monitoring and elucidate potential therapeutic interventions.
Collapse
Affiliation(s)
- Thomas A. Premeaux
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Scott Bowler
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Courtney M. Friday
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Carlee B. Moser
- Center for Biostatistics in AIDS Research in the Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Martin Hoenigl
- Division of Infectious Diseases, Department of Medicine, University of California San Diego, San Diego, CA, USA
- Division of Infectious Diseases, Department of Internal Medicine, Medical University of Graz, Graz, Austria
| | - Michael M. Lederman
- Department of Medicine, Division of Infectious Diseases and HIV Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Alan L. Landay
- Department of Internal Medicine, Rush University Medical Center, Chicago, IL, USA
| | - Sara Gianella
- Division of Infectious Diseases, Department of Medicine, University of California San Diego, San Diego, CA, USA
| | - Lishomwa C. Ndhlovu
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| |
Collapse
|
2
|
Karaglani M, Agorastos A, Panagopoulou M, Parlapani E, Athanasis P, Bitsios P, Tzitzikou K, Theodosiou T, Iliopoulos I, Bozikas VP, Chatzaki E. A novel blood-based epigenetic biosignature in first-episode schizophrenia patients through automated machine learning. Transl Psychiatry 2024; 14:257. [PMID: 38886359 PMCID: PMC11183091 DOI: 10.1038/s41398-024-02946-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 05/15/2024] [Accepted: 05/17/2024] [Indexed: 06/20/2024] Open
Abstract
Schizophrenia (SCZ) is a chronic, severe, and complex psychiatric disorder that affects all aspects of personal functioning. While SCZ has a very strong biological component, there are still no objective diagnostic tests. Lately, special attention has been given to epigenetic biomarkers in SCZ. In this study, we introduce a three-step, automated machine learning (AutoML)-based, data-driven, biomarker discovery pipeline approach, using genome-wide DNA methylation datasets and laboratory validation, to deliver a highly performing, blood-based epigenetic biosignature of diagnostic clinical value in SCZ. Publicly available blood methylomes from SCZ patients and healthy individuals were analyzed via AutoML, to identify SCZ-specific biomarkers. The methylation of the identified genes was then analyzed by targeted qMSP assays in blood gDNA of 30 first-episode drug-naïve SCZ patients and 30 healthy controls (CTRL). Finally, AutoML was used to produce an optimized disease-specific biosignature based on patient methylation data combined with demographics. AutoML identified a SCZ-specific set of novel gene methylation biomarkers including IGF2BP1, CENPI, and PSME4. Functional analysis investigated correlations with SCZ pathology. Methylation levels of IGF2BP1 and PSME4, but not CENPI were found to differ, IGF2BP1 being higher and PSME4 lower in the SCZ group as compared to the CTRL group. Additional AutoML classification analysis of our experimental patient data led to a five-feature biosignature including all three genes, as well as age and sex, that discriminated SCZ patients from healthy individuals [AUC 0.755 (0.636, 0.862) and average precision 0.758 (0.690, 0.825)]. In conclusion, this three-step pipeline enabled the discovery of three novel genes and an epigenetic biosignature bearing potential value as promising SCZ blood-based diagnostics.
Collapse
Affiliation(s)
- Makrina Karaglani
- Laboratory of Pharmacology, Department of Medicine, Democritus University of Thrace, GR-68132, Alexandroupolis, Greece
- Institute of Agri-food and Life Sciences, University Research & Innovation Center, H.M.U.R.I.C., Hellenic Mediterranean University, GR-71003, Crete, Greece
| | - Agorastos Agorastos
- Institute of Agri-food and Life Sciences, University Research & Innovation Center, H.M.U.R.I.C., Hellenic Mediterranean University, GR-71003, Crete, Greece
- II. Department of Psychiatry, Faculty of Health Sciences, School of Medicine, Aristotle University of Thessaloniki, GR-56430, Thessaloniki, Greece
| | - Maria Panagopoulou
- Laboratory of Pharmacology, Department of Medicine, Democritus University of Thrace, GR-68132, Alexandroupolis, Greece
- Institute of Agri-food and Life Sciences, University Research & Innovation Center, H.M.U.R.I.C., Hellenic Mediterranean University, GR-71003, Crete, Greece
| | - Eleni Parlapani
- Ι. Department of Psychiatry, Faculty of Health Sciences, School of Medicine, Aristotle University of Thessaloniki, GR-56429, Thessaloniki, Greece
| | - Panagiotis Athanasis
- II. Department of Psychiatry, Faculty of Health Sciences, School of Medicine, Aristotle University of Thessaloniki, GR-56430, Thessaloniki, Greece
| | - Panagiotis Bitsios
- Department of Psychiatry and Behavioral Sciences, Faculty of Medicine, University of Crete, GR-71500, Heraklion, Greece
| | - Konstantina Tzitzikou
- Laboratory of Pharmacology, Department of Medicine, Democritus University of Thrace, GR-68132, Alexandroupolis, Greece
| | - Theodosis Theodosiou
- Laboratory of Pharmacology, Department of Medicine, Democritus University of Thrace, GR-68132, Alexandroupolis, Greece
- ABCureD P.C, GR-68131, Alexandroupolis, Greece
| | - Ioannis Iliopoulos
- Division of Basic Sciences, School of Medicine, University of Crete, GR-71003, Heraklion, Greece
| | - Vasilios-Panteleimon Bozikas
- II. Department of Psychiatry, Faculty of Health Sciences, School of Medicine, Aristotle University of Thessaloniki, GR-56430, Thessaloniki, Greece
| | - Ekaterini Chatzaki
- Laboratory of Pharmacology, Department of Medicine, Democritus University of Thrace, GR-68132, Alexandroupolis, Greece.
- Institute of Agri-food and Life Sciences, University Research & Innovation Center, H.M.U.R.I.C., Hellenic Mediterranean University, GR-71003, Crete, Greece.
- ABCureD P.C, GR-68131, Alexandroupolis, Greece.
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology, 70013, Heraklion, Greece.
| |
Collapse
|
3
|
Li YY, Yuan MM, Li YY, Li S, Wang JD, Wang YF, Li Q, Li J, Chen RR, Peng JM, Du B. Cell-free DNA methylation reveals cell-specific tissue injury and correlates with disease severity and patient outcomes in COVID-19. Clin Epigenetics 2024; 16:37. [PMID: 38429730 PMCID: PMC10908074 DOI: 10.1186/s13148-024-01645-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 02/16/2024] [Indexed: 03/03/2024] Open
Abstract
BACKGROUND The recently identified methylation patterns specific to cell type allows the tracing of cell death dynamics at the cellular level in health and diseases. This study used COVID-19 as a disease model to investigate the efficacy of cell-specific cell-free DNA (cfDNA) methylation markers in reflecting or predicting disease severity or outcome. METHODS Whole genome methylation sequencing of cfDNA was performed for 20 healthy individuals, 20 cases with non-hospitalized COVID-19 and 12 cases with severe COVID-19 admitted to intensive care unit (ICU). Differentially methylated regions (DMRs) and gene ontology pathway enrichment analyses were performed to explore the locus-specific methylation difference between cohorts. The proportion of cfDNA derived from lung and immune cells to a given sample (i.e. tissue fraction) at cell-type resolution was estimated using a novel algorithm, which reflects lung injuries and immune response in COVID-19 patients and was further used to evaluate clinical severity and patient outcome. RESULTS COVID‑19 patients had globally reduced cfDNA methylation level compared with healthy controls. Compared with non-hospitalized COVID-19 patients, the cfDNA methylation pattern was significantly altered in severe patients with the identification of 11,156 DMRs, which were mainly enriched in pathways related to immune response. Markedly elevated levels of cfDNA derived from lung and more specifically alveolar epithelial cells, bronchial epithelial cells, and lung endothelial cells were observed in COVID-19 patients compared with healthy controls. Compared with non-hospitalized patients or healthy controls, severe COVID-19 had significantly higher cfDNA derived from B cells, T cells and granulocytes and lower cfDNA from natural killer cells. Moreover, cfDNA derived from alveolar epithelial cells had the optimal performance to differentiate COVID-19 with different severities, lung injury levels, SOFA scores and in-hospital deaths, with the area under the receiver operating characteristic curve of 0.958, 0.941, 0.919 and 0.955, respectively. CONCLUSION Severe COVID-19 has a distinct cfDNA methylation signature compared with non-hospitalized COVID-19 and healthy controls. Cell type-specific cfDNA methylation signature enables the tracing of COVID-19 related cell deaths in lung and immune cells at cell-type resolution, which is correlated with clinical severities and outcomes, and has extensive application prospects to evaluate tissue injuries in diseases with multi-organ dysfunction.
Collapse
Affiliation(s)
- Yuan-Yuan Li
- Medical ICU, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China
| | - Ming-Ming Yuan
- Geneplus-Beijing, Floor 9, Building 6, Medical Park Road, Zhongguancun Life Science Park, Changping District, Beijing, 102206, China
| | - Yuan-Yuan Li
- Medical ICU, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China
| | - Shan Li
- Medical ICU, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China
| | - Jing-Dong Wang
- Geneplus-Shenzhen, Building B, First Branch, Zhongcheng Life Science Park, Zhongxing Road, Kengzi Street, Pingshan District, Shenzhen, 518000, China
| | - Yu-Fei Wang
- Geneplus-Shenzhen, Building B, First Branch, Zhongcheng Life Science Park, Zhongxing Road, Kengzi Street, Pingshan District, Shenzhen, 518000, China
| | - Qian Li
- Geneplus-Beijing, Floor 9, Building 6, Medical Park Road, Zhongguancun Life Science Park, Changping District, Beijing, 102206, China
| | - Jun Li
- Geneplus-Shenzhen, Building B, First Branch, Zhongcheng Life Science Park, Zhongxing Road, Kengzi Street, Pingshan District, Shenzhen, 518000, China
| | - Rong-Rong Chen
- Geneplus-Beijing, Floor 9, Building 6, Medical Park Road, Zhongguancun Life Science Park, Changping District, Beijing, 102206, China
| | - Jin-Min Peng
- Medical ICU, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.
| | - Bin Du
- Medical ICU, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.
| |
Collapse
|
4
|
Thomaidis GV, Papadimitriou K, Michos S, Chartampilas E, Tsamardinos I. A characteristic cerebellar biosignature for bipolar disorder, identified with fully automatic machine learning. IBRO Neurosci Rep 2023; 15:77-89. [PMID: 38025660 PMCID: PMC10668096 DOI: 10.1016/j.ibneur.2023.06.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 05/19/2023] [Accepted: 06/29/2023] [Indexed: 12/01/2023] Open
Abstract
Background Transcriptomic profile differences between patients with bipolar disorder and healthy controls can be identified using machine learning and can provide information about the potential role of the cerebellum in the pathogenesis of bipolar disorder.With this aim, user-friendly, fully automated machine learning algorithms can achieve extremely high classification scores and disease-related predictive biosignature identification, in short time frames and scaled down to small datasets. Method A fully automated machine learning platform, based on the most suitable algorithm selection and relevant set of hyper-parameter values, was applied on a preprocessed transcriptomics dataset, in order to produce a model for biosignature selection and to classify subjects into groups of patients and controls. The parent GEO datasets were originally produced from the cerebellar and parietal lobe tissue of deceased bipolar patients and healthy controls, using Affymetrix Human Gene 1.0 ST Array. Results Patients and controls were classified into two separate groups, with no close-to-the-boundary cases, and this classification was based on the cerebellar transcriptomic biosignature of 25 features (genes), with Area Under Curve 0.929 and Average Precision 0.955. The biosignature includes both genes connected before to bipolar disorder, depression, psychosis or epilepsy, as well as genes not linked before with any psychiatric disease. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed participation of 4 identified features in 6 pathways which have also been associated with bipolar disorder. Conclusion Automated machine learning (AutoML) managed to identify accurately 25 genes that can jointly - in a multivariate-fashion - separate bipolar patients from healthy controls with high predictive power. The discovered features lead to new biological insights. Machine Learning (ML) analysis considers the features in combination (in contrast to standard differential expression analysis), removing both irrelevant as well as redundant markers, and thus, focusing to biological interpretation.
Collapse
Affiliation(s)
- Georgios V. Thomaidis
- Greek National Health System, Psychiatric Department, Katerini General Hospital, Katerini, Greece
| | - Konstantinos Papadimitriou
- Greek National Health System, G. Papanikolaou General Hospital, Organizational Unit - Psychiatric Hospital of Thessaloniki, Thessaloniki, Greece
| | | | - Evangelos Chartampilas
- Laboratory of Radiology, AHEPA General Hospital, University of Thessaloniki, Thessaloniki, Greece
| | | |
Collapse
|
5
|
Dey A, Vaishak K, Deka D, Radhakrishnan AK, Paul S, Shanmugam P, Daniel AP, Pathak S, Duttaroy AK, Banerjee A. Epigenetic perspectives associated with COVID-19 infection and related cytokine storm: an updated review. Infection 2023; 51:1603-1618. [PMID: 36906872 PMCID: PMC10008189 DOI: 10.1007/s15010-023-02017-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 02/27/2023] [Indexed: 03/13/2023]
Abstract
PURPOSE The COVID-19 pandemic caused by the novel Severe Acute Respiratory Syndrome Corona Virus 2 (SARS-CoV-2) has put the world in a medical crisis for the past three years; nearly 6.3 million lives have been diminished due to the virus outbreak. This review aims to update the recent findings on COVID-19 infections from an epigenetic scenario and develop future perspectives of epi-drugs to treat the disease. METHODS Original research articles and review studies related to COVID-19 were searched and analyzed from the Google Scholar/PubMed/Medline databases mainly between 2019 and 2022 to brief the recent work. RESULTS Numerous in-depth studies of the mechanisms used by SARS-CoV-2 have been going on to minimize the consequences of the viral outburst. Angiotensin-Converting Enzyme 2 receptors and Transmembrane serine protease 2 facilitate viral entry to the host cells. Upon internalization, it uses the host machinery to replicate viral copies and alter the downstream regulation of the normal cells, causing infection-related morbidities and mortalities. In addition, several epigenetic regulations such as DNA methylation, acetylation, histone modifications, microRNA, and other factors (age, sex, etc.) are responsible for the regulations of viral entry, its immune evasion, and cytokine responses also play a major modulatory role in COVID-19 severity, which has been discussed in detail in this review. CONCLUSION Findings of epigenetic regulation of viral pathogenicity open a new window for epi-drugs as a possible therapeutical approach against COVID-19.
Collapse
Affiliation(s)
- Amit Dey
- Department of Medical Biotechnology, Faculty of Allied Health Sciences, Chettinad Academy of Research and Education (CARE), Chettinad Hospital and Research Institute (CHRI), Kelambakkam, Chennai, TN, 603103, India
| | - K Vaishak
- Department of Medical Biotechnology, Faculty of Allied Health Sciences, Chettinad Academy of Research and Education (CARE), Chettinad Hospital and Research Institute (CHRI), Kelambakkam, Chennai, TN, 603103, India
| | - Dikshita Deka
- Department of Medical Biotechnology, Faculty of Allied Health Sciences, Chettinad Academy of Research and Education (CARE), Chettinad Hospital and Research Institute (CHRI), Kelambakkam, Chennai, TN, 603103, India
| | - Arun Kumar Radhakrishnan
- Department of Pharmacology, Chettinad Hospital and Research Institute (CHRI), Chettinad Academy of Research and Education (CARE), Chennai, TN, India
| | - Sujay Paul
- Tecnologico de Monterrey, School of Engineering and Sciences, Campus Queretaro, Av. Epigmenio Gonzalez, No.500 Fracc., CP 76130, San Pablo, Querétaro, Mexico
| | - Priyadarshini Shanmugam
- Department of Microbiology, Chettinad Hospital and Research Institute (CHRI), Chettinad Academy of Research and Education (CARE), Chennai, TN, 603103, India
| | - Alice Peace Daniel
- Department of Microbiology, Chettinad Hospital and Research Institute (CHRI), Chettinad Academy of Research and Education (CARE), Chennai, TN, 603103, India
| | - Surajit Pathak
- Department of Medical Biotechnology, Faculty of Allied Health Sciences, Chettinad Academy of Research and Education (CARE), Chettinad Hospital and Research Institute (CHRI), Kelambakkam, Chennai, TN, 603103, India
| | - Asim K Duttaroy
- Department of Nutrition, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, Oslo, Norway.
| | - Antara Banerjee
- Department of Medical Biotechnology, Faculty of Allied Health Sciences, Chettinad Academy of Research and Education (CARE), Chettinad Hospital and Research Institute (CHRI), Kelambakkam, Chennai, TN, 603103, India.
| |
Collapse
|
6
|
Lakiotaki K, Papadovasilakis Z, Lagani V, Fafalios S, Charonyktakis P, Tsagris M, Tsamardinos I. Automated machine learning for genome wide association studies. Bioinformatics 2023; 39:btad545. [PMID: 37672022 PMCID: PMC10562960 DOI: 10.1093/bioinformatics/btad545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Revised: 06/29/2023] [Accepted: 09/05/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) present several computational and statistical challenges for their data analysis, including knowledge discovery, interpretability, and translation to clinical practice. RESULTS We develop, apply, and comparatively evaluate an automated machine learning (AutoML) approach, customized for genomic data that delivers reliable predictive and diagnostic models, the set of genetic variants that are important for predictions (called a biosignature), and an estimate of the out-of-sample predictive power. This AutoML approach discovers variants with higher predictive performance compared to standard GWAS methods, computes an individual risk prediction score, generalizes to new, unseen data, is shown to better differentiate causal variants from other highly correlated variants, and enhances knowledge discovery and interpretability by reporting multiple equivalent biosignatures. AVAILABILITY AND IMPLEMENTATION Code for this study is available at: https://github.com/mensxmachina/autoML-GWAS. JADBio offers a free version at: https://jadbio.com/sign-up/. SNP data can be downloaded from the EGA repository (https://ega-archive.org/). PRS data are found at: https://www.aicrowd.com/challenges/opensnp-height-prediction. Simulation data to study population structure can be found at: https://easygwas.ethz.ch/data/public/dataset/view/1/.
Collapse
Affiliation(s)
| | - Zaharias Papadovasilakis
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
- Laboratory of Immune Regulation and Tolerance, School of Medicine, University of Crete, Heraklion, Greece
| | - Vincenzo Lagani
- Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology KAUST, Thuwal 23952, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, Thuwal 23952, Saudi Arabia
- Institute of Chemical Biology, Ilia State University, Tbilisi, Georgia
| | - Stefanos Fafalios
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| | - Paulos Charonyktakis
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| | - Michail Tsagris
- Department of Computer Science, University of Crete, Heraklion, Greece
- Department of Economics, University of Crete, Heraklion, Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| |
Collapse
|
7
|
Yu Z, Yang Z, Lan Q, Wang Y, Huang F, Cai Y. Kmer-Node2Vec: a Fast and Efficient Method for Kmer Embedding from the Kmer Co-occurrence Graph, with Applications to DNA Sequences. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-4. [PMID: 38083774 DOI: 10.1109/embc40787.2023.10341090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Learning low-dimensional continuous vector representation for short k-mers divided from long DNA sequences is key to DNA sequence modeling that can be utilized in many bioinformatics investigations, such as DNA sequence retrieval and classification. DNA2Vec is the most widely used method for DNA sequence embedding. However, it poorly scales to large data sets due to its extremely long training time in kmer embedding. In this paper, we propose a novel efficient graph-based kmer embedding method, named Kmer-Node2Vec, to tackle this concern. Our method converts the large DNA corpus into one kmer co-occurrence graph, and extracts kmer relation on the graph by random walks to learn fast and high-quality kmer embedding. Extensive experiments show that our method is faster than DNA2Vec by 29 times for training on a 4GB data set, and on par with DNA2Vec in terms of task-specific accuracy of sequence retrieval and classification.
Collapse
|