1
|
Huckvale ED, Moseley HNB. A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement. PLoS One 2024; 19:e0299583. [PMID: 38696410 PMCID: PMC11065254 DOI: 10.1371/journal.pone.0299583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/13/2024] [Indexed: 05/04/2024] Open
Abstract
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, Kentucky, United States of America
- Superfund Research Center, University of Kentucky, Lexington, Kentucky, United States of America
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky, United States of America
- Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, United States of America
| |
Collapse
|
2
|
Huckvale ED, Moseley HN. Predicting The Pathway Involvement Of Metabolites Based on Combined Metabolite and Pathway Features. bioRxiv 2024:2024.04.01.587582. [PMID: 38617261 PMCID: PMC11014601 DOI: 10.1101/2024.04.01.587582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
A major limitation of most metabolomics datasets is the sparsity of pathway annotations of detected metabolites. It is common for less than half of identified metabolites in these datasets to have known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by one of the metabolic knowledgebases like the Kyoto Encyclopedia of Gene and Genomes. Most of these models are implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating predictions for multiple pathway categories. This single binary classifier per pathway category approach both multiplies the computational resources necessary for training while diluting the positive entries in gold standard datasets needed for training. To address the limitations of training separate classifiers, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts both features representing a metabolite and features representing a generic pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features-pair approach is not only competitive with the combined performance of training separate binary classifiers, but it outperforms the previous benchmark models.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
| | - Hunter N.B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40506, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
3
|
Huckvale ED, Powell CD, Jin H, Moseley HNB. Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites. Metabolites 2023; 13:1120. [PMID: 37999216 PMCID: PMC10673125 DOI: 10.3390/metabo13111120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 10/25/2023] [Accepted: 10/30/2023] [Indexed: 11/25/2023] Open
Abstract
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
| | - Christian D. Powell
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
| | - Huan Jin
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40506, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
4
|
Huckvale ED, Powell CD, Jin H, Moseley HN. Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites. bioRxiv 2023:2023.10.03.560715. [PMID: 37873272 PMCID: PMC10592640 DOI: 10.1101/2023.10.03.560715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1-score of 0.8180 and Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
| | - Christian D. Powell
- Department of Computer Science (Data Science Program), University of Kentucky, Lexington, KY 40506, USA
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
| | - Huan Jin
- Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA
| | - Hunter N.B. Moseley
- Markey Cancer Center, University of Kentucky, Lexington, KY 40506, USA
- Superfund Research Center, University of Kentucky, Lexington, KY 40506, USA
- Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, KY 40506, USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| |
Collapse
|
5
|
Huckvale ED, Hodgman MW, Greenwood BB, Stucki DO, Ward KM, Ebbert MTW, Kauwe JSK, Miller JB. Pairwise Correlation Analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset Reveals Significant Feature Correlation. Genes (Basel) 2021; 12:1661. [PMID: 34828267 PMCID: PMC8619902 DOI: 10.3390/genes12111661] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 12/04/2022] Open
Abstract
The Alzheimer's Disease Neuroimaging Initiative (ADNI) contains extensive patient measurements (e.g., magnetic resonance imaging [MRI], biometrics, RNA expression, etc.) from Alzheimer's disease (AD) cases and controls that have recently been used by machine learning algorithms to evaluate AD onset and progression. While using a variety of biomarkers is essential to AD research, highly correlated input features can significantly decrease machine learning model generalizability and performance. Additionally, redundant features unnecessarily increase computational time and resources necessary to train predictive models. Therefore, we used 49,288 biomarkers and 793,600 extracted MRI features to assess feature correlation within the ADNI dataset to determine the extent to which this issue might impact large scale analyses using these data. We found that 93.457% of biomarkers, 92.549% of the gene expression values, and 100% of MRI features were strongly correlated with at least one other feature in ADNI based on our Bonferroni corrected α (p-value ≤ 1.40754 × 10-13). We provide a comprehensive mapping of all ADNI biomarkers to highly correlated features within the dataset. Additionally, we show that significant correlation within the ADNI dataset should be resolved before performing bulk data analyses, and we provide recommendations to address these issues. We anticipate that these recommendations and resources will help guide researchers utilizing the ADNI dataset to increase model performance and reduce the cost and complexity of their analyses.
Collapse
Affiliation(s)
- Erik D. Huckvale
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA; (E.D.H.); (M.W.H.); (M.T.W.E.)
| | - Matthew W. Hodgman
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA; (E.D.H.); (M.W.H.); (M.T.W.E.)
| | - Brianna B. Greenwood
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (B.B.G.); (D.O.S.); (K.M.W.); (J.S.K.K.)
| | - Devorah O. Stucki
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (B.B.G.); (D.O.S.); (K.M.W.); (J.S.K.K.)
| | - Katrisa M. Ward
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (B.B.G.); (D.O.S.); (K.M.W.); (J.S.K.K.)
| | - Mark T. W. Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA; (E.D.H.); (M.W.H.); (M.T.W.E.)
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (B.B.G.); (D.O.S.); (K.M.W.); (J.S.K.K.)
| | | | | | - Justin B. Miller
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA; (E.D.H.); (M.W.H.); (M.T.W.E.)
| |
Collapse
|