1
|
Grady SK, Dojcsak L, Harville EW, Wallace ME, Vilda D, Donneyong MM, Hood DB, Valdez RB, Ramesh A, Im W, Matthews-Juarez P, Juarez PD, Langston MA. Seminar: Scalable Preprocessing Tools for Exposomic Data Analysis. ENVIRONMENTAL HEALTH PERSPECTIVES 2023; 131:124201. [PMID: 38109119 PMCID: PMC10727037 DOI: 10.1289/ehp12901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 11/22/2023] [Accepted: 11/28/2023] [Indexed: 12/19/2023]
Abstract
BACKGROUND The exposome serves as a popular framework in which to study exposures from chemical and nonchemical stressors across the life course and the differing roles that these exposures can play in human health. As a result, data relevant to the exposome have been used as a resource in the quest to untangle complicated health trajectories and help connect the dots from exposures to adverse outcome pathways. OBJECTIVES The primary aim of this methods seminar is to clarify and review preprocessing techniques critical for accurate and effective external exposomic data analysis. Scalability is emphasized through an application of highly innovative combinatorial techniques coupled with more traditional statistical strategies. The Public Health Exposome is used as an archetypical model. The novelty and innovation of this seminar's focus stem from its methodical, comprehensive treatment of preprocessing and its demonstration of the positive effects preprocessing can have on downstream analytics. DISCUSSION State-of-the-art technologies are described for data harmonization and to mitigate noise, which can stymie downstream interpretation, and to select key exposomic features, without which analytics may lose focus. A main task is the reduction of multicollinearity, a particularly formidable problem that frequently arises from repeated measurements of similar events taken at various times and from multiple sources. Empirical results highlight the effectiveness of a carefully planned preprocessing workflow as demonstrated in the context of more highly concentrated variable lists, improved correlational distributions, and enhanced downstream analytics for latent relationship discovery. The nascent field of exposome science can be characterized by the need to analyze and interpret a complex confluence of highly inhomogeneous spatial and temporal data, which may present formidable challenges to even the most powerful analytical tools. A systematic approach to preprocessing can therefore provide an essential first step in the application of modern computer and data science methods. https://doi.org/10.1289/EHP12901.
Collapse
Affiliation(s)
- Stephen K. Grady
- Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, Tennessee, USA
| | - Levente Dojcsak
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, Tennessee, USA
| | - Emily W. Harville
- Department Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA
| | - Maeve E. Wallace
- Department of Social, Behavioral, and Population Sciences, Tulane University School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA
| | - Dovile Vilda
- Department of Social, Behavioral, and Population Sciences, Tulane University School of Public Health and Tropical Medicine, New Orleans, Louisiana, USA
| | | | - Darryl B. Hood
- Division of Environmental Health Sciences, College of Public Health, Ohio State University, Columbus, Ohio, USA
| | - R. Burciaga Valdez
- Department of Economics, University of New Mexico, Albuquerque, New Mexico, USA
| | - Aramandla Ramesh
- Department of Biochemistry, Cancer Biology, Neuroscience & Pharmacology, Meharry Medical College, Nashville, Tennessee, USA
| | - Wansoo Im
- Department of Family and Community Medicine, Meharry Medical College, Nashville, Tennessee, USA
| | | | - Paul D. Juarez
- Department of Family and Community Medicine, Meharry Medical College, Nashville, Tennessee, USA
- Institute on Health Disparities, Equity, and the Exposome, Meharry Medical College, Nashville, Tennessee, USA
| | - Michael A. Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, Tennessee, USA
| |
Collapse
|
2
|
Machine learning algorithm-generated and multi-center validated melanoma prognostic signature with inspiration for treatment management. Cancer Immunol Immunother 2023; 72:599-615. [PMID: 35998003 DOI: 10.1007/s00262-022-03279-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 08/12/2022] [Indexed: 10/15/2022]
Abstract
BACKGROUND Although immunotherapy and targeted treatments have dramatically improved the survival of melanoma patients, the intra- or intertumoral heterogeneity and drug resistance have hindered the further expansion of clinical benefits. METHODS The 96 combination frames constructed by ten machine learning algorithms identified a prognostic consensus signature based on 1002 melanoma samples from nine independent cohorts. Clinical features and 26 published signatures were employed to compare the predictive performance of our model. RESULTS A machine learning-based prognostic signature (MLPS) with the highest average C-index was developed via 96 algorithm combinations. The MLPS has a stable and excellent predictive performance for overall survival, superior to common clinical traits and 26 collected signatures. The low MLPS group with a better prognosis had significantly enriched immune-related pathways, tending to be an immune-hot phenotype and possessing potential immunotherapeutic responses to anti-PD-1, anti-CTLA-4, and MAGE-A3. On the contrary, the high MLPS group with more complex genomic alterations and poorer prognoses is more sensitive to the BRAF inhibitor dabrafenib, confirmed in patients with BRAF mutations. CONCLUSION MLPS could independently and stably predict the prognosis of melanoma, considered a promising biomarker to identify patients suitable for immunotherapy and those with BRAF mutations who would benefit from dabrafenib.
Collapse
|
3
|
Huang YH, Ku HM, Wang CA, Chen LY, He SS, Chen S, Liao PC, Juan PY, Kao CF. A multiple phenotype imputation method for genetic diversity and core collection in Taiwanese vegetable soybean. FRONTIERS IN PLANT SCIENCE 2022; 13:948349. [PMID: 36119593 PMCID: PMC9480828 DOI: 10.3389/fpls.2022.948349] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 07/25/2022] [Indexed: 06/15/2023]
Abstract
Establishment of vegetable soybean (edamame) [Glycine max (L.) Merr.] germplasms has been highly valued in Asia and the United States owing to the increasing market demand for edamame. The idea of core collection (CC) is to shorten the breeding program so as to improve the availability of germplasm resources. However, multidimensional phenotypes typically are highly correlated and have different levels of missing rate, often failing to capture the underlying pattern of germplasms and select CC precisely. These are commonly observed on correlated samples. To overcome such scenario, we introduced the "multiple imputation" (MI) method to iteratively impute missing phenotypes for 46 morphological traits and jointly analyzed high-dimensional imputed missing phenotypes (EC impu ) to explore population structure and relatedness among 200 Taiwanese vegetable soybean accessions. An advanced maximization strategy with a heuristic algorithm and PowerCore was used to evaluate the morphological diversity among the EC impu . In total, 36 accessions (denoted as CC impu ) were efficiently selected representing high diversity and the entire coverage of the EC impu . Only 4 (8.7%) traits showed slightly significant differences between the CC impu and EC impu . Compared to the EC impu , 96% traits retained all characteristics or had a slight diversity loss in the CC impu . The CC impu exhibited a small percentage of significant mean difference (4.51%), and large coincidence rate (98.1%), variable rate (138.76%), and coverage (close to 100%), indicating the representativeness of the EC impu . We noted that the CC impu outperformed the CC raw in evaluation properties, suggesting that the multiple phenotype imputation method has the potential to deal with missing phenotypes in correlated samples efficiently and reliably without re-phenotyping accessions. Our results illustrated a significant role of imputed missing phenotypes in support of the MI-based framework for plant-breeding programs.
Collapse
Affiliation(s)
- Yen-Hsiang Huang
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Hsin-Mei Ku
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Chong-An Wang
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Ling-Yu Chen
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Shan-Syue He
- Department of Agronomy, College of Bioresources and Agriculture, National Taiwan University, Taipei, Taiwan
| | - Shu Chen
- Plant Germplasm Division, Taiwan Agricultural Research Institute, Taichung, Taiwan
| | - Po-Chun Liao
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Pin-Yuan Juan
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Chung-Feng Kao
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
- Advanced Plant Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
| |
Collapse
|
4
|
Imputation of Missing Values for Multi-Biospecimen Metabolomics Studies: Bias and Effects on Statistical Validity. Metabolites 2022; 12:metabo12070671. [PMID: 35888795 PMCID: PMC9317643 DOI: 10.3390/metabo12070671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 07/07/2022] [Accepted: 07/19/2022] [Indexed: 02/05/2023] Open
Abstract
The analysis of high-throughput metabolomics mass spectrometry data across multiple biological sample types (biospecimens) poses challenges due to missing data. During differential abundance analysis, dropping samples with missing values can lead to severe loss of data as well as biased results in group comparisons and effect size estimates. However, the imputation of missing data (the process of replacing missing data with estimated values such as a mean) may compromise the inherent intra-subject correlation of a metabolite across multiple biospecimens from the same subject, which in turn may compromise the efficacy of the statistical analysis of differential metabolites in biomarker discovery. We investigated imputation strategies when considering multiple biospecimens from the same subject. We compared a novel, but simple, approach that consists of combining the two biospecimen data matrices (rows and columns of subjects and metabolites) and imputes the two biospecimen data matrices together to an approach that imputes each biospecimen data matrix separately. We then compared the bias in the estimation of the intra-subject multi-specimen correlation and its effects on the validity of statistical significance tests between two approaches. The combined approach to multi-biospecimen studies has not been evaluated previously even though it is intuitive and easy to implement. We examine these two approaches for five imputation methods: random forest, k nearest neighbor, expectation-maximization with bootstrap, quantile regression, and half the minimum observed value. Combining the biospecimen data matrices for imputation did not greatly increase efficacy in conserving the correlation structure or improving accuracy in the statistical conclusions for most of the methods examined. Random forest tended to outperform the other methods in all performance metrics, except specificity.
Collapse
|
5
|
Ampong I, Zimmerman KD, Nathanielsz PW, Cox LA, Olivier M. Optimization of Imputation Strategies for High-Resolution Gas Chromatography-Mass Spectrometry (HR GC-MS) Metabolomics Data. Metabolites 2022; 12:429. [PMID: 35629933 PMCID: PMC9144635 DOI: 10.3390/metabo12050429] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/07/2022] [Accepted: 05/09/2022] [Indexed: 12/17/2022] Open
Abstract
Gas chromatography-coupled mass spectrometry (GC-MS) has been used in biomedical research to analyze volatile, non-polar, and polar metabolites in a wide array of sample types. Despite advances in technology, missing values are still common in metabolomics datasets and must be properly handled. We evaluated the performance of ten commonly used missing value imputation methods with metabolites analyzed on an HR GC-MS instrument. By introducing missing values into the complete (i.e., data without any missing values) National Institute of Standards and Technology (NIST) plasma dataset, we demonstrate that random forest (RF), glmnet ridge regression (GRR), and Bayesian principal component analysis (BPCA) shared the lowest root mean squared error (RMSE) in technical replicate data. Further examination of these three methods in data from baboon plasma and liver samples demonstrated they all maintained high accuracy. Overall, our analysis suggests that any of the three imputation methods can be applied effectively to untargeted metabolomics datasets with high accuracy. However, it is important to note that imputation will alter the correlation structure of the dataset and bias downstream regression coefficients and p-values.
Collapse
Affiliation(s)
- Isaac Ampong
- Center for Precision Medicine, Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University, Winston-Salem, NC 27157, USA; (I.A.); (K.D.Z.); (L.A.C.)
| | - Kip D. Zimmerman
- Center for Precision Medicine, Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University, Winston-Salem, NC 27157, USA; (I.A.); (K.D.Z.); (L.A.C.)
| | - Peter W. Nathanielsz
- Center for the Study of Fetal Programming, University of Wyoming, Laramie, WY 82071, USA;
- Southwest National Primate Research Center, San Antonio, TX 78227, USA
| | - Laura A. Cox
- Center for Precision Medicine, Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University, Winston-Salem, NC 27157, USA; (I.A.); (K.D.Z.); (L.A.C.)
- Southwest National Primate Research Center, San Antonio, TX 78227, USA
| | - Michael Olivier
- Center for Precision Medicine, Department of Internal Medicine, Section on Molecular Medicine, Wake Forest University, Winston-Salem, NC 27157, USA; (I.A.); (K.D.Z.); (L.A.C.)
| |
Collapse
|
6
|
Muller J, Garrison L, Ulbrich P, Schreiber S, Bruckner S, Hauser H, Oeltze-Jafra S. Integrated Dual Analysis of Quantitative and Qualitative High-Dimensional Data. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:2953-2966. [PMID: 33534707 DOI: 10.1109/tvcg.2021.3056424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The Dual Analysis framework is a powerful enabling technology for the exploration of high dimensional quantitative data by treating data dimensions as first-class objects that can be explored in tandem with data values. In this article, we extend the Dual Analysis framework through the joint treatment of quantitative (numerical) and qualitative (categorical) dimensions. Computing common measures for all dimensions allows us to visualize both quantitative and qualitative dimensions in the same view. This enables a natural joint treatment of mixed data during interactive visual exploration and analysis. Several measures of variation for nominal qualitative data can also be applied to ordinal qualitative and quantitative data. For example, instead of measuring variability from a mean or median, other measures assess inter-data variation or average variation from a mode. In this work, we demonstrate how these measures can be integrated into the Dual Analysis framework to explore and generate hypotheses about high-dimensional mixed data. A medical case study using clinical routine data of patients suffering from Cerebral Small Vessel Disease (CSVD), conducted with a senior neurologist and a medical student, shows that a joint Dual Analysis approach for quantitative and qualitative data can rapidly lead to new insights based on which new hypotheses may be generated.
Collapse
|
7
|
Liu X, Liu P, Chernock RD, Yang Z, Lang Kuhs KA, Lewis JS, Luo J, Li H, Gay HA, Thorstad WL, Wang X. A MicroRNA Expression Signature as Prognostic Marker for Oropharyngeal Squamous Cell Carcinoma. J Natl Cancer Inst 2021; 113:752-759. [PMID: 33057626 PMCID: PMC8168274 DOI: 10.1093/jnci/djaa161] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 08/05/2020] [Accepted: 09/28/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Improved prognostication of oropharyngeal squamous cell carcinoma (OPSCC) may facilitate individualized patient management. The goal of this study was to develop and validate a prognostic signature based on microRNA sequencing (miRNA-seq) analysis. METHODS We collected tumor specimens for miRNA-seq analysis from OPSCC patients treated at Washington University in St Louis (n = 324) and Vanderbilt University (n = 130). OPSCC patients (n = 79) from The Cancer Genome Atlas Program were also included for independent validation. Univariate and multivariable Cox regression analyses were performed to identify miRNAs associated with disease outcomes. All statistical tests were 2-sided. RESULTS By miRNA-seq profiling analysis, we identified a 26-miRNA signature. Based on computed risk scores of the signature, we classified the patients into low- and high-risk groups. In the training cohort, the high-risk group had much shorter overall survival compared with the low-risk group (hazard ratio [HR] = 3.80, 95% confidence interval [CI] = 2.37 to 6.10, P < .001). Subgroup analysis further revealed that the signature was prognostic for HPV-positive OPSCCs (HR = 3.07, 95% CI = 1.65 to 5.71, P < .001). Multivariable analysis indicated that the signature was independent of common clinicopathologic factors for OPSCCs. Importantly, the miRNA signature was a statistically significant predictor of overall survival in independent validation cohorts (The Cancer Genome Atlas Program cohort: HR = 6.05, 95% CI = 2.10 to 17.37, P < .001; Vanderbilt cohort: HR = 7.98, 95% CI = 3.99 to 15.97, P < .001; Vanderbilt HPV-positive cohort: HR = 8.71, 95% CI = 2.70 to 28.14, P < .001). CONCLUSIONS The miRNA signature is a robust and independent prognostic tool for risk stratification of OPSCCs including HPV-positive OPSCCs.
Collapse
Affiliation(s)
- Xinyi Liu
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Ping Liu
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Rebecca D Chernock
- Department of Pathology and Immunology, Washington University School of Medicine, St Louis, MO, USA
| | - Zhenming Yang
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Krystle A Lang Kuhs
- Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James S. Lewis
- Department of Surgery, Washington University School of Medicine, St Louis, MO, USA
- Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Jingqin Luo
- Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Hua Li
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Hiram A Gay
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Wade L Thorstad
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| | - Xiaowei Wang
- Department of Radiation Oncology, Washington University School of Medicine, St Louis, MO, USA
| |
Collapse
|
8
|
Wang C, Plusquin M, Ghantous A, Herceg Z, Alfano R, Cox B, Nawrot TS. DNA methylation of insulin-like growth factor 2 and H19 cluster in cord blood and prenatal air pollution exposure to fine particulate matter. Environ Health 2020; 19:129. [PMID: 33287817 PMCID: PMC7720562 DOI: 10.1186/s12940-020-00677-9] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 11/13/2020] [Indexed: 05/06/2023]
Abstract
BACKGROUND The IGF2 (insulin-like growth factor 2) and H19 gene cluster plays an important role during pregnancy as it promotes both foetal and placental growth. We investigated the association between cord blood DNA methylation status of the IGF2/H19 gene cluster and maternal fine particulate matter exposure during fetal life. To the best of our knowledge, this is the first study investigating the association between prenatal PM2.5 exposure and newborn DNA methylation of the IGF2/H19. METHODS Cord blood DNA methylation status of IGF2/H19 cluster was measured in 189 mother-newborn pairs from the ENVIRONAGE birth cohort (Flanders, Belgium). We assessed the sex-specific association between residential PM2.5 exposure during pregnancy and the methylation level of CpG loci mapping to the IGF2/H19 cluster, and identified prenatal vulnerability by investigating susceptible time windows of exposure. We also addressed the biological functionality of DNA methylation level in the gene cluster. RESULTS Prenatal PM2.5 exposure was found to have genetic region-specific significant association with IGF2 and H19 during specific gestational weeks. The association was found to be sex-specific in both gene regions. Functionality of the DNA methylation was annotated by the association to fetal growth and cellular pathways. CONCLUSIONS The results of our study provided evidence that prenatal PM2.5 exposure is associated with DNA methylation in newborns' IGF2/H19. The consequences within the context of fetal development of future phenotyping should be addressed.
Collapse
Affiliation(s)
- Congrong Wang
- Centre for Environmental Sciences, Hasselt University, Agoralaan gebouw D, 3590 Diepenbeek, Hasselt, Belgium
| | - Michelle Plusquin
- Centre for Environmental Sciences, Hasselt University, Agoralaan gebouw D, 3590 Diepenbeek, Hasselt, Belgium
| | - Akram Ghantous
- Epigenetics Group, International Agency for Research on Cancer (IARC), Lyon, France
| | - Zdenko Herceg
- Epigenetics Group, International Agency for Research on Cancer (IARC), Lyon, France
| | - Rossella Alfano
- Centre for Environmental Sciences, Hasselt University, Agoralaan gebouw D, 3590 Diepenbeek, Hasselt, Belgium
| | - Bianca Cox
- Centre for Environmental Sciences, Hasselt University, Agoralaan gebouw D, 3590 Diepenbeek, Hasselt, Belgium
| | - Tim S. Nawrot
- Centre for Environmental Sciences, Hasselt University, Agoralaan gebouw D, 3590 Diepenbeek, Hasselt, Belgium
- Department of Public Health and Primary Care, Leuven University, Leuven, Belgium
| |
Collapse
|
9
|
Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé EA. Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources. Metabolites 2020; 10:E202. [PMID: 32429287 PMCID: PMC7281435 DOI: 10.3390/metabo10050202] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/07/2020] [Accepted: 05/13/2020] [Indexed: 02/06/2023] Open
Abstract
As researchers are increasingly able to collect data on a large scale from multiple clinical and omics modalities, multi-omics integration is becoming a critical component of metabolomics research. This introduces a need for increased understanding by the metabolomics researcher of computational and statistical analysis methods relevant to multi-omics studies. In this review, we discuss common types of analyses performed in multi-omics studies and the computational and statistical methods that can be used for each type of analysis. We pinpoint the caveats and considerations for analysis methods, including required parameters, sample size and data distribution requirements, sources of a priori knowledge, and techniques for the evaluation of model accuracy. Finally, for the types of analyses discussed, we provide examples of the applications of corresponding methods to clinical and basic research. We intend that our review may be used as a guide for metabolomics researchers to choose effective techniques for multi-omics analyses relevant to their field of study.
Collapse
Affiliation(s)
- Tara Eicher
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
| | - Garrett Kinnebrew
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Bioinformatics Shared Resource Group, The Ohio State University, Columbus, OH 43210, USA
| | - Andrew Patt
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
| | - Kyle Spencer
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
- Nationwide Children’s Research Hospital, Columbus, OH 43210, USA
| | - Kevin Ying
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Molecular, Cellular and Developmental Biology Program, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
| | - Raghu Machiraju
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
- Department of Pathology, Wexner Medical Center, The Ohio State University, Columbus, OH 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH 43210, USA
| | - Ewy A. Mathé
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
| |
Collapse
|
10
|
Long NP, Nghi TD, Kang YP, Anh NH, Kim HM, Park SK, Kwon SW. Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine. Metabolites 2020; 10:E51. [PMID: 32013105 PMCID: PMC7074059 DOI: 10.3390/metabo10020051] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Revised: 01/17/2020] [Accepted: 01/21/2020] [Indexed: 12/18/2022] Open
Abstract
Despite the tremendous success, pitfalls have been observed in every step of a clinical metabolomics workflow, which impedes the internal validity of the study. Furthermore, the demand for logistics, instrumentations, and computational resources for metabolic phenotyping studies has far exceeded our expectations. In this conceptual review, we will cover inclusive barriers of a metabolomics-based clinical study and suggest potential solutions in the hope of enhancing study robustness, usability, and transferability. The importance of quality assurance and quality control procedures is discussed, followed by a practical rule containing five phases, including two additional "pre-pre-" and "post-post-" analytical steps. Besides, we will elucidate the potential involvement of machine learning and demonstrate that the need for automated data mining algorithms to improve the quality of future research is undeniable. Consequently, we propose a comprehensive metabolomics framework, along with an appropriate checklist refined from current guidelines and our previously published assessment, in the attempt to accurately translate achievements in metabolomics into clinical and epidemiological research. Furthermore, the integration of multifaceted multi-omics approaches with metabolomics as the pillar member is in urgent need. When combining with other social or nutritional factors, we can gather complete omics profiles for a particular disease. Our discussion reflects the current obstacles and potential solutions toward the progressing trend of utilizing metabolomics in clinical research to create the next-generation healthcare system.
Collapse
Affiliation(s)
- Nguyen Phuoc Long
- College of Pharmacy, Seoul National University, Seoul 08826, Korea; (N.P.L.); (N.H.A.); (H.M.K.)
| | - Tran Diem Nghi
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea; (T.D.N.); (S.K.P.)
| | - Yun Pyo Kang
- Department of Cancer Physiology, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA;
| | - Nguyen Hoang Anh
- College of Pharmacy, Seoul National University, Seoul 08826, Korea; (N.P.L.); (N.H.A.); (H.M.K.)
| | - Hyung Min Kim
- College of Pharmacy, Seoul National University, Seoul 08826, Korea; (N.P.L.); (N.H.A.); (H.M.K.)
| | - Sang Ki Park
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea; (T.D.N.); (S.K.P.)
| | - Sung Won Kwon
- College of Pharmacy, Seoul National University, Seoul 08826, Korea; (N.P.L.); (N.H.A.); (H.M.K.)
| |
Collapse
|
11
|
Kokla M, Virtanen J, Kolehmainen M, Paananen J, Hanhineva K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics 2019; 20:492. [PMID: 31601178 PMCID: PMC6788053 DOI: 10.1186/s12859-019-3110-0] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 09/20/2019] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.
Collapse
Affiliation(s)
- Marietta Kokla
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Jyrki Virtanen
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Marjukka Kolehmainen
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
- VTT Technical Research Centre of Finland Ltd, P.O. Box 1000, FI-02044 VTT Espoo, Finland
| | - Jussi Paananen
- Institute of Biomedicine, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Kati Hanhineva
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| |
Collapse
|
12
|
Considine EC. The Search for Clinically Useful Biomarkers of Complex Disease: A Data Analysis Perspective. Metabolites 2019; 9:E126. [PMID: 31269649 PMCID: PMC6680669 DOI: 10.3390/metabo9070126] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 06/20/2019] [Accepted: 06/28/2019] [Indexed: 12/25/2022] Open
Abstract
Unmet clinical diagnostic needs exist for many complex diseases, which (it is hoped) will be solved by the discovery of metabolomics biomarkers. However, at present, no diagnostic tests based on metabolomics have yet been introduced to the clinic. This review is presented as a research perspective on how data analysis methods in metabolomics biomarker discovery may contribute to the failure of biomarker studies and suggests how such failures might be mitigated. The study design and data pretreatment steps are reviewed briefly in this context, and the actual data analysis step is examined more closely.
Collapse
Affiliation(s)
- Elizabeth C Considine
- The Irish Centre for Fetal and Neonatal Translational Research (INFANT), Department of Obstetrics and Gynaecology, University College Cork, T12 YE02 Cork, Ireland.
| |
Collapse
|
13
|
Jin Z, Kang J, Yu T. Missing value imputation for LC-MS metabolomics data by incorporating metabolic network and adduct ion relations. Bioinformatics 2019; 34:1555-1561. [PMID: 29272352 DOI: 10.1093/bioinformatics/btx816] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Accepted: 12/19/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation Metabolomics data generated from liquid chromatography-mass spectrometry platforms often contain missing values. Existing imputation methods do not consider underlying feature relations and the metabolic network information. As a result, the imputation results may not be optimal. Results We proposed an imputation algorithm that incorporates the existing metabolic network, adduct ion relations even for unknown compounds, as well as linear and nonlinear associations between feature intensities to build a feature-level network. The algorithm uses support vector regression for missing value imputation based on features in the neighborhood on the network. We compared our proposed method with methods being widely used. As judged by the normalized root mean squared error in real data-based simulations, our proposed methods can achieve better accuracy. Availability and implementation The R package is available at http://web1.sph.emory.edu/users/tyu8/MINMA. Contact jiankang@umich.edu or tianwei.yu@emory.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhuxuan Jin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
14
|
Do KT, Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, Suhre K, Strauch K, Peters A, Gieger C, Langenberg C, Stewart ID, Theis FJ, Grallert H, Kastenmüller G, Krumsiek J. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 2018; 14:128. [PMID: 30830398 PMCID: PMC6153696 DOI: 10.1007/s11306-018-1420-2] [Citation(s) in RCA: 121] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 08/24/2018] [Indexed: 12/12/2022]
Abstract
BACKGROUND Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.
Collapse
Affiliation(s)
- Kieu Trinh Do
- Institute of Computational Biology, Helmholtz-Zentrum München, Neuherberg, Germany
| | - Simone Wahl
- Institute of Epidemiology II, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- Research Unit of Molecular Epidemiology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- German Center for Diabetes Research (DZD e.V.), Neuherberg, Germany
| | - Johannes Raffler
- Institute of Bioinformatics and Systems Biology, Helmholtz-Zentrum München, Neuherberg, Germany
| | - Sophie Molnos
- Institute of Epidemiology II, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- Research Unit of Molecular Epidemiology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- German Center for Diabetes Research (DZD e.V.), Neuherberg, Germany
| | - Michael Laimighofer
- Institute of Computational Biology, Helmholtz-Zentrum München, Neuherberg, Germany
| | - Jerzy Adamski
- Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, Neuherberg, Germany
- Lehrstuhl für Experimentelle Genetik, Technische Universität München, Freising, Germany
- German Center for Cardiovascular Disease Research (DZHK e.V.), Munich, Germany
| | - Karsten Suhre
- Department of Physiology and Biophysics, Weill Cornell Medical College in Qatar, Education City, Doha, Qatar
| | - Konstantin Strauch
- Institute of Genetic Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, Germany
- Chair of Genetic Epidemiology, Institute of Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-University, Munich, Germany
| | - Annette Peters
- Institute of Epidemiology II, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- Research Unit of Molecular Epidemiology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
| | - Christian Gieger
- Institute of Epidemiology II, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- Research Unit of Molecular Epidemiology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
| | | | | | - Fabian J Theis
- Institute of Computational Biology, Helmholtz-Zentrum München, Neuherberg, Germany
- Department of Mathematics, Technische Universität München, Garching, Germany
| | - Harald Grallert
- Institute of Epidemiology II, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- Research Unit of Molecular Epidemiology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany
- German Center for Diabetes Research (DZD e.V.), Neuherberg, Germany
| | - Gabi Kastenmüller
- German Center for Diabetes Research (DZD e.V.), Neuherberg, Germany.
- Institute of Bioinformatics and Systems Biology, Helmholtz-Zentrum München, Neuherberg, Germany.
| | - Jan Krumsiek
- Institute of Computational Biology, Helmholtz-Zentrum München, Neuherberg, Germany.
- German Center for Diabetes Research (DZD e.V.), Neuherberg, Germany.
- Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, New York, USA.
| |
Collapse
|
15
|
Ladva CN, Golan R, Greenwald R, Yu T, Sarnat SE, Flanders WD, Uppal K, Walker DI, Tran V, Liang D, Jones DP, Sarnat JA. Metabolomic profiles of plasma, exhaled breath condensate, and saliva are correlated with potential for air toxics detection. J Breath Res 2017; 12:016008. [PMID: 28808178 DOI: 10.1088/1752-7163/aa863c] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
INTRODUCTION Advances in the development of high-resolution metabolomics (HRM) have provided new opportunities for their use in characterizing exposures to environmental air pollutants and air pollution-related disease etiologies. Exposure assessment studies have considered blood, breath, and saliva as biological matrices suitable for measuring responses to air pollution exposures. The current study examines comparability among these three matrices using HRM and explores their potential for measuring mobile-source air toxics. METHODS Four participants provided saliva, exhaled breath concentrate (EBC), and plasma before and after a 2 h road traffic exposure. Samples were analyzed on a Thermo Scientific QExactive MS system in positive electrospray ionization mode and resolution of 70 000 full-width at half-maximum with C18 chromatography. Data were processed using an apLCMS and xMSanalyzer on the R statistical platform. RESULTS The analysis yielded 7110, 6019, and 7747 reproducible features in plasma, EBC, and saliva, respectively. Correlations were moderate-to-strong (R = 0.41-0.80) across all pairwise comparisons of feature intensity within profiles, with the strongest between EBC and saliva. The associations of mean intensities between matrix pairs were positive and significant, controlling for subject and sampling time effects. Six out of 20 features shared in all three matrices putatively matched a list of known mobile-source air toxics. CONCLUSIONS Plasma, saliva, and EBC have largely comparable metabolic profiles measurable through HRM. These matrices have the potential to be used in identification and measurement of exposures to mobile-source air toxics, though further, targeted study is needed.
Collapse
Affiliation(s)
- Chandresh Nanji Ladva
- Department of Environmental Health, Rollins School of Public Health, Emory University, 1518 Clifton Road, Atlanta, GA 30322, United States of America
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Guasch-Ferré M, Bhupathiraju SN, Hu FB. Use of Metabolomics in Improving Assessment of Dietary Intake. Clin Chem 2017; 64:82-98. [PMID: 29038146 DOI: 10.1373/clinchem.2017.272344] [Citation(s) in RCA: 172] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2017] [Accepted: 09/07/2017] [Indexed: 01/23/2023]
Abstract
BACKGROUND Nutritional metabolomics is rapidly evolving to integrate nutrition with complex metabolomics data to discover new biomarkers of nutritional exposure and status. CONTENT The purpose of this review is to provide a broad overview of the measurement techniques, study designs, and statistical approaches used in nutrition metabolomics, as well as to describe the current knowledge from epidemiologic studies identifying metabolite profiles associated with the intake of individual nutrients, foods, and dietary patterns. SUMMARY A wide range of technologies, databases, and computational tools are available to integrate nutritional metabolomics with dietary and phenotypic information. Biomarkers identified with the use of high-throughput metabolomics techniques include amino acids, acylcarnitines, carbohydrates, bile acids, purine and pyrimidine metabolites, and lipid classes. The most extensively studied food groups include fruits, vegetables, meat, fish, bread, whole grain cereals, nuts, wine, coffee, tea, cocoa, and chocolate. We identified 16 studies that evaluated metabolite signatures associated with dietary patterns. Dietary patterns examined included vegetarian and lactovegetarian diets, omnivorous diet, Western dietary patterns, prudent dietary patterns, Nordic diet, and Mediterranean diet. Although many metabolite biomarkers of individual foods and dietary patterns have been identified, those biomarkers may not be sensitive or specific to dietary intakes. Some biomarkers represent short-term intakes rather than long-term dietary habits. Nonetheless, nutritional metabolomics holds promise for the development of a robust and unbiased strategy for measuring diet. Still, this technology is intended to be complementary, rather than a replacement, to traditional well-validated dietary assessment methods such as food frequency questionnaires that can measure usual diet, the most relevant exposure in nutritional epidemiologic studies.
Collapse
Affiliation(s)
- Marta Guasch-Ferré
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA
| | - Shilpa N Bhupathiraju
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA.,Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA
| | - Frank B Hu
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA; .,Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA.,Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, MA
| |
Collapse
|
17
|
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics 2017; 18:114. [PMID: 28219348 PMCID: PMC5319174 DOI: 10.1186/s12859-017-1547-6] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 02/13/2017] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. RESULTS Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. CONCLUSION Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.
Collapse
Affiliation(s)
- Jasmit S Shah
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA. .,Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA.
| | - Shesh N Rai
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA
| | - Andrew P DeFilippis
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Bradford G Hill
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Aruni Bhatnagar
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Guy N Brock
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA. .,Present Affiliation: Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
18
|
Metabolomic Profiling of the Malaria Box Reveals Antimalarial Target Pathways. Antimicrob Agents Chemother 2016; 60:6635-6649. [PMID: 27572391 DOI: 10.1128/aac.01224-16] [Citation(s) in RCA: 100] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Accepted: 08/16/2016] [Indexed: 12/11/2022] Open
Abstract
The threat of widespread drug resistance to frontline antimalarials has renewed the urgency for identifying inexpensive chemotherapeutic compounds that are effective against Plasmodium falciparum, the parasite species responsible for the greatest number of malaria-related deaths worldwide. To aid in the fight against malaria, a recent extensive screening campaign has generated thousands of lead compounds with low micromolar activity against blood stage parasites. A subset of these leads has been compiled by the Medicines for Malaria Venture (MMV) into a collection of structurally diverse compounds known as the MMV Malaria Box. Currently, little is known regarding the activity of these Malaria Box compounds on parasite metabolism during intraerythrocytic development, and a majority of the targets for these drugs have yet to be defined. Here we interrogated the in vitro metabolic effects of 189 drugs (including 169 of the drug-like compounds from the Malaria Box) using ultra-high-performance liquid chromatography-mass spectrometry (UHPLC-MS). The resulting metabolic fingerprints provide information on the parasite biochemical pathways affected by pharmacologic intervention and offer a critical blueprint for selecting and advancing lead compounds as next-generation antimalarial drugs. Our results reveal several major classes of metabolic disruption, which allow us to predict the mode of action (MoA) for many of the Malaria Box compounds. We anticipate that future combination therapies will be greatly informed by these results, allowing for the selection of appropriate drug combinations that simultaneously target multiple metabolic pathways, with the aim of eliminating malaria and forestalling the expansion of drug-resistant parasites in the field.
Collapse
|
19
|
Taylor SL, Ruhaak LR, Weiss RH, Kelly K, Kim K. Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens. Bioinformatics 2016; 33:17-25. [PMID: 27592710 DOI: 10.1093/bioinformatics/btw578] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 08/30/2016] [Accepted: 08/31/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High through-put mass spectrometry (MS) is now being used to profile small molecular compounds across multiple biological sample types from the same subjects with the goal of leveraging information across biospecimens. Multivariate statistical methods that combine information from all biospecimens could be more powerful than the usual univariate analyses. However, missing values are common in MS data and imputation can impact between-biospecimen correlation and multivariate analysis results. RESULTS We propose two multivariate two-part statistics that accommodate missing values and combine data from all biospecimens to identify differentially regulated compounds. Statistical significance is determined using a multivariate permutation null distribution. Relative to univariate tests, the multivariate procedures detected more significant compounds in three biological datasets. In a simulation study, we showed that multi-biospecimen testing procedures were more powerful than single-biospecimen methods when compounds are differentially regulated in multiple biospecimens but univariate methods can be more powerful if compounds are differentially regulated in only one biospecimen. AVAILABILITY AND IMPLEMENTATION We provide R functions to implement and illustrate our method as supplementary information CONTACT: sltaylor@ucdavis.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| | - L Renee Ruhaak
- Department of Clinical Chemistry and Laboratory Medicine, Leiden University Medical Center, Leiden, The Netherlands
| | | | - Karen Kelly
- Division of Hematology and Oncology, Department of Internal Medicine School of Medicine, University of California, Davis, CA 95616, USA
| | - Kyoungmi Kim
- Division of Biostatistics, Department of Public Health Sciences, University of California Davis, CA, 95616, USA
| |
Collapse
|