1
|
Kumler W, Hazelton BJ, Ingalls AE. Picky with peakpicking: assessing chromatographic peak quality with simple metrics in metabolomics. BMC Bioinformatics 2023; 24:404. [PMID: 37891484 PMCID: PMC10612323 DOI: 10.1186/s12859-023-05533-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 10/16/2023] [Indexed: 10/29/2023] Open
Abstract
BACKGROUND Chromatographic peakpicking continues to represent a significant bottleneck in automated LC-MS workflows. Uncontrolled false discovery rates and the lack of manually-calibrated quality metrics require researchers to visually evaluate individual peaks, requiring large amounts of time and breaking replicability. This problem is exacerbated in noisy environmental datasets and for novel separation methods such as hydrophilic interaction columns in metabolomics, creating a demand for a simple, intuitive, and robust metric of peak quality. RESULTS Here, we manually labeled four HILIC oceanographic particulate metabolite datasets to assess the performance of individual peak quality metrics. We used these datasets to construct a predictive model calibrated to the likelihood that visual inspection by an MS expert would include a given mass feature in the downstream analysis. We implemented two novel peak quality metrics, a custom signal-to-noise metric and a test of similarity to a bell curve, both calculated from the raw data in the extracted ion chromatogram, and found that these outperformed existing measurements of peak quality. A simple logistic regression model built on two metrics reduced the fraction of false positives in the analysis from 70-80% down to 1-5% and showed minimal overfitting when applied to novel datasets. We then explored the implications of this quality thresholding on the conclusions obtained by the downstream analysis and found that while only 10% of the variance in the dataset could be explained by depth in the default output from the peakpicker, approximately 40% of the variance was explained when restricted to high-quality peaks alone. CONCLUSIONS We conclude that the poor performance of peakpicking algorithms significantly reduces the power of both univariate and multivariate statistical analyses to detect environmental differences. We demonstrate that simple models built on intuitive metrics and derived from the raw data are more robust and can outperform more complex models when applied to new data. Finally, we show that in properly curated datasets, depth is a major driver of variability in the marine microbial metabolome and identify several interesting metabolite trends for future investigation.
Collapse
Affiliation(s)
- William Kumler
- School of Oceanography, University of Washington, Seattle, WA, 98195, USA
| | - Bryna J Hazelton
- eScience Institute, University of Washington, Seattle, WA, 98195, USA
- Department of Physics, University of Washington, Seattle, WA, 98195, USA
| | - Anitra E Ingalls
- School of Oceanography, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
2
|
Guo J, Huan T. Mechanistic Understanding of the Discrepancies between Common Peak Picking Algorithms in Liquid Chromatography–Mass Spectrometry-Based Metabolomics. Anal Chem 2023; 95:5894-5902. [PMID: 36972195 DOI: 10.1021/acs.analchem.2c04887] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
Inconsistent peak picking outcomes are a critical concern in processing liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics data. This work systematically studied the mechanisms behind the discrepancies among five commonly used peak picking algorithms, including CentWave in XCMS, linear-weighted moving average in MS-DIAL, automated data analysis pipeline (ADAP) in MZmine 2, Savitzky-Golay in El-MAVEN, and FeatureFinderMetabo in OpenMS. We first collected 10 public metabolomics datasets representing various LC-MS analytical conditions. We then incorporated several novel strategies to (i) acquire the optimal peak picking parameters of each algorithm for a fair comparison, (ii) automatically recognize false metabolic features with poor chromatographic peak shapes, and (iii) evaluate the real metabolic features that are missed by the algorithms. By applying these strategies, we compared the true, false, and undetected metabolic features in each data processing outcome. Our results show that linear-weighted moving average consistently outperforms the other peak picking algorithms. To facilitate a mechanistic understanding of the differences, we proposed six peak attributes: ideal slope, sharpness, peak height, mass deviation, peak width, and scan number. We also developed an R program to automatically measure these attributes for detected and undetected true metabolic features. From the results of the 10 datasets, we concluded that four peak attributes, including ideal slope, scan number, peak width, and mass deviation, are critical for the detectability of a peak. For instance, the focus on ideal slope critically hinders the extraction of true metabolic features with low ideal slope scores in linear-weighted moving average, Savitzky-Golay, and ADAP. The relationships between peak picking algorithms and peak attributes were also visualized in a principal component analysis biplot. Overall, the clear comparison and explanation of the differences between peak picking algorithms can lead to the design of better peak picking strategies in the future.
Collapse
|
3
|
Houriet J, Vidar WS, Manwill PK, Todd DA, Cech NB. How Low Can You Go? Selecting Intensity Thresholds for Untargeted Metabolomics Data Preprocessing. Anal Chem 2022; 94:17964-17971. [PMID: 36516972 DOI: 10.1021/acs.analchem.2c04088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Untargeted mass spectrometry (MS) metabolomics is an increasingly popular approach for characterizing complex mixtures. Recent studies have highlighted the impact of data preprocessing for determining the quality of metabolomics data analysis. The first step in data processing with untargeted metabolomics requires that signal thresholds be selected for which features (detected ions) are included in the dataset. Analysts face the challenge of knowing where to set these thresholds; setting them too high could mean missing relevant features, but setting them too low could result in a complex and unwieldy dataset. This study compared data interpretation for an example metabolomics dataset when intensity thresholds were set at a range of feature heights. The main observations were that low signal thresholds (1) improved the limit of detection, (2) increased the number of features detected with an associated isotope pattern and/or an MS-MS fragmentation spectrum, and (3) increased the number of in-source clusters and fragments detected for known analytes of interest. When the settings of parameters differing in intensities were applied on a set of 39 samples to discriminate the samples through principal component analyses (PCA), similar results were obtained with both low- and high-intensity thresholds. We conclude that the most information-rich datasets can be obtained by setting low-intensity thresholds. However, in the cases where only a qualitative comparison of samples with PCA is to be performed, it may be sufficient to set high thresholds and thereby reduce the complexity of the data processing and amount of computational time required.
Collapse
Affiliation(s)
- Joelle Houriet
- Department of Chemistry & Biochemistry, University of North Carolina at Greensboro, Greensboro, North Carolina 27402, United States
| | - Warren S Vidar
- Department of Chemistry & Biochemistry, University of North Carolina at Greensboro, Greensboro, North Carolina 27402, United States
| | - Preston K Manwill
- Department of Chemistry & Biochemistry, University of North Carolina at Greensboro, Greensboro, North Carolina 27402, United States
| | - Daniel A Todd
- Department of Chemistry & Biochemistry, University of North Carolina at Greensboro, Greensboro, North Carolina 27402, United States
| | - Nadja B Cech
- Department of Chemistry & Biochemistry, University of North Carolina at Greensboro, Greensboro, North Carolina 27402, United States
| |
Collapse
|
4
|
Barupal DK. Response: Commentary: Data processing thresholds for abundance and sparsity and missed biological insights in an untargeted chemical analysis of blood specimens for exposomics. Front Public Health 2022; 10:1003148. [PMID: 36330107 PMCID: PMC9622927 DOI: 10.3389/fpubh.2022.1003148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 09/28/2022] [Indexed: 01/27/2023] Open
|
5
|
Petrick LM, Shomron N. AI/ML-driven advances in untargeted metabolomics and exposomics for biomedical applications. CELL REPORTS. PHYSICAL SCIENCE 2022; 3:100978. [PMID: 35936554 PMCID: PMC9354369 DOI: 10.1016/j.xcrp.2022.100978] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Metabolomics describes a high-throughput approach for measuring a repertoire of metabolites and small molecules in biological samples. One utility of untargeted metabolomics, unbiased global analysis of the metabolome, is to detect key metabolites as contributors to, or readouts of, human health and disease. In this perspective, we discuss how artificial intelligence (AI) and machine learning (ML) have promoted major advances in untargeted metabolomics workflows and facilitated pivotal findings in the areas of disease screening and diagnosis. We contextualize applications of AI and ML to the emerging field of high-resolution mass spectrometry (HRMS) exposomics, which unbiasedly detects endogenous metabolites and exogenous chemicals in human tissue to characterize exposure linked with disease outcomes. We discuss the state of the science and suggest potential opportunities for using AI and ML to improve data quality, rigor, detection, and chemical identification in untargeted metabolomics and exposomics studies.
Collapse
Affiliation(s)
- Lauren M. Petrick
- The Bert Strassburger Metabolic Center, Sheba Medical Center, Tel-Hashomer, Israel
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Noam Shomron
- Faculty of Medicine, Edmond J. Safra Center for Bioinformatics, Sagol School of Neuroscience, Center for Nanoscience and Nanotechnology, Center for Innovation Laboratories (TILabs), Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
6
|
Barupal DK, Mahajan P, Fakouri-Baygi S, Wright RO, Arora M, Teitelbaum SL. CCDB: A database for exploring inter-chemical correlations in metabolomics and exposomics datasets. ENVIRONMENT INTERNATIONAL 2022; 164:107240. [PMID: 35461097 PMCID: PMC9195052 DOI: 10.1016/j.envint.2022.107240] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 04/01/2022] [Accepted: 04/08/2022] [Indexed: 05/18/2023]
Abstract
Inter-chemical correlations in metabolomics and exposomics datasets provide valuable information for studying relationships among chemicals reported for human specimens. With an increase in the number of compounds for these datasets, a network graph analysis and visualization of the correlation structure is difficult to interpret. We have developed the Chemical Correlation Database (CCDB), as a systematic catalogue of inter-chemical correlation in publicly available metabolomics and exposomics studies. The database has been provided via an online interface to create single compound-centric views. We have demonstrated various applications of the database to explore: 1) the chemicals from a chemical class such as Per- and Polyfluoroalkyl Substances (PFAS), polycyclic aromatic hydrocarbons (PAHs), polychlorinated biphenyls (PCBs), phthalates and tobacco smoke related metabolites; 2) xenobiotic metabolites such as caffeine and acetaminophen; 3) endogenous metabolites (acyl-carnitines); and 4) unannotated peaks for PFAS. The database has a rich collection of 35 human studies, including the National Health and Nutrition Examination Survey (NHANES) and high-quality untargeted metabolomics datasets. CCDB is supported by a simple, interactive and user-friendly web-interface to retrieve and visualize the inter-chemical correlation data. The CCDB has the potential to be a key computational resource in metabolomics and exposomics facilitating the expansion of our understanding about biological and chemical relationships among metabolites and chemical exposures in the human body. The database is available at www.ccdb.idsl.me site.
Collapse
Affiliation(s)
- Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA.
| | - Priyanka Mahajan
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA
| | - Sadjad Fakouri-Baygi
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA
| | - Robert O Wright
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA
| | - Manish Arora
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA
| | - Susan L Teitelbaum
- Department of Environmental Medicine and Public Health, Institute for Exposomic Research, Icahn School of Medicine at Mount Sinai, 17 E 102nd St, CAM Building, New York 10029, USA
| |
Collapse
|
7
|
Fakouri Baygi S, Kumar Y, Barupal DK. IDSL.IPA Characterizes the Organic Chemical Space in Untargeted LC/HRMS Data Sets. J Proteome Res 2022; 21:1485-1494. [PMID: 35579321 DOI: 10.1021/acs.jproteome.2c00120] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Generating comprehensive and high-fidelity metabolomics data matrices from LC/HRMS data remains to be extremely challenging for population-scale large studies (n > 200). Here, we present a new data processing pipeline, the Intrinsic Peak Analysis (IDSL.IPA) R package (https://ipa.idsl.me), to generate such data matrices specifically for organic compounds. The IDSL.IPA pipeline incorporates (1) identifying potential 12C and 13C ion pairs in individual mass spectra; (2) detecting and characterizing chromatographic peaks using a new sensitive and versatile approach to perform mass correction, peak smoothing, baseline development for local noise measurement, and peak quality determination; (3) correcting retention time and cross-referencing peaks from multiple samples by a dynamic retention index marker approach; (4) annotating peaks using a reference database of m/z and retention time; and (5) accelerating data processing using a parallel computation of the peak detection and alignment steps for larger studies. This pipeline has been successfully evaluated for studies ranging from 200 to 1600 samples. By specifically isolating high quality and reliable signals pertaining to carbon-containing compounds in untargeted LC/HRMS data sets from larger studies, IDSL.IPA opens new opportunities for discovering new biological insights in the population-scale metabolomics and exposomics projects. The package is available in the R CRAN repository at https://cran.r-project.org/package=IDSL.IPA.
Collapse
Affiliation(s)
- Sadjad Fakouri Baygi
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York 10029, United States
| | - Yashwant Kumar
- Non-communicable Diseases Division, Translational Health Science and Technology Institute, Faridabad, Haryana 121001, India
| | - Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, New York 10029, United States
| |
Collapse
|
8
|
Keski-Rahkonen P, Robinson O, Alfano R, Plusquin M, Scalbert A. Commentary: Data Processing Thresholds for Abundance and Sparsity and Missed Biological Insights in an Untargeted Chemical Analysis of Blood Specimens for Exposomics. Front Public Health 2022; 9:755837. [PMID: 35111711 PMCID: PMC8801530 DOI: 10.3389/fpubh.2021.755837] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 12/06/2021] [Indexed: 11/13/2022] Open
Affiliation(s)
- Pekka Keski-Rahkonen
- Nutrition and Metabolism Branch, International Agency for Research on Cancer (IARC/WHO), Lyon, France
| | - Oliver Robinson
- Medical Research Council Centre for Environment and Health, School of Public Health, Imperial College London, London, United Kingdom
| | - Rossella Alfano
- Medical Research Council Centre for Environment and Health, School of Public Health, Imperial College London, London, United Kingdom
- Centre for Environmental Sciences, Hasselt University, Hasselt, Belgium
| | - Michelle Plusquin
- Centre for Environmental Sciences, Hasselt University, Hasselt, Belgium
| | - Augustin Scalbert
- Nutrition and Metabolism Branch, International Agency for Research on Cancer (IARC/WHO), Lyon, France
| |
Collapse
|