1
|
Bramer LM, Nakayasu ES, Flores JE, Van Eyk JE, MacCoss MJ, Parikh HM, Metz TO, Webb-Robertson BJM. Data from a multi-year targeted proteomics study of a longitudinal birth cohort of type 1 diabetes. Sci Data 2025; 12:112. [PMID: 39833216 PMCID: PMC11747092 DOI: 10.1038/s41597-024-04249-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 12/05/2024] [Indexed: 01/22/2025] Open
Abstract
The deployment of liquid chromatography-mass spectrometry-based plasma proteomics experiments in a large cohort is sparse, leading to a lack of data available for benchmarking, method development or validation. Comprised of 6,426 plasma analyses, The Environmental Determinants of Diabetes in the Young (TEDDY) proteomics validation study constitutes one of the largest targeted proteomics experiments in the literature to date. The proteomics data from this study were generated over the course of 2.5 years from over 900 study subjects, each providing up to 29 longitudinal samples. The data also includes 916 quality control samples. The targeted mass spectrometry assay was comprised of 694 peptides mapping to 167 proteins and the panel was measured in each subject and QC sample. The targeted proteomic dataset presented here can be used as a resource for new computational methods development, such as for batch correction, as well as for benchmarking and comparing the performance of different methods/tools.
Collapse
Grants
- R01 DK138335 NIDDK NIH HHS
- U01 KD127786-S1 U.S. Department of Health & Human Services | NIH | National Institute of Diabetes and Digestive and Kidney Diseases (National Institute of Diabetes & Digestive & Kidney Diseases)
- U01 DK127786 NIDDK NIH HHS
- U.S. Department of Health & Human Services | NIH | National Institute of Diabetes and Digestive and Kidney Diseases (National Institute of Diabetes & Digestive & Kidney Diseases)
- U.S. Department of Health & Human Services | NIH | Office of Extramural Research, National Institutes of Health (OER)
- National Institutes of Health: U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, U01 DK124166, U01 DK128847, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and Breakthrough T1D (formerly JDRF). This work is supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR002535).
Collapse
Affiliation(s)
- Lisa M Bramer
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA.
| | - Ernesto S Nakayasu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | - Javier E Flores
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | - Jennifer E Van Eyk
- Department of Cardiology, Advanced Clinical Biosystem Research Institue, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Hemang M Parikh
- Health Informatics Institute, University of South Florida, Tampa, FL, USA
| | - Thomas O Metz
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | | |
Collapse
|
2
|
Schumann Y, Gocke A, Neumann JE. Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets. Proteomics 2025; 25:e202400100. [PMID: 39740174 DOI: 10.1002/pmic.202400100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 11/08/2024] [Accepted: 11/26/2024] [Indexed: 01/02/2025]
Abstract
Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses. We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.
Collapse
Affiliation(s)
- Yannis Schumann
- IT-Department, Deutsches Elektronen-Synchroton DESY, Hamburg, Germany
| | - Antonia Gocke
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Core Facility Mass Spectrometric Proteomics, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| | - Julia E Neumann
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Institute of Neuropathology, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| |
Collapse
|
3
|
Marković S, Jadranin M, Miladinović Z, Gavrilović A, Avramović N, Takić M, Tasic L, Tešević V, Mandić B. LC-HRMS Lipidomic Fingerprints in Serbian Cohort of Schizophrenia Patients. Int J Mol Sci 2024; 25:10266. [PMID: 39408605 PMCID: PMC11476971 DOI: 10.3390/ijms251910266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Revised: 09/12/2024] [Accepted: 09/15/2024] [Indexed: 10/20/2024] Open
Abstract
Schizophrenia (SCH) is a major mental illness that causes impaired cognitive function and long-term disability, so the requirements for reliable biomarkers for early diagnosis and therapy of SCH are essential. The objective of this work was an untargeted lipidomic study of serum samples from a Serbian cohort including 30 schizophrenia (SCH) patients and 31 non-psychiatric control (C) individuals by applying liquid chromatography (LC) coupled with high-resolution mass spectrometry (HRMS) and chemometric analyses. Principal component analysis (PCA) of all samples indicated no clear separation between SCH and C groups but indicated clear gender separation in the C group. Multivariate statistical analyses (PCA and orthogonal partial least squares discriminant analysis (OPLS-DA)) of gender-differentiated SCH and C groups established forty-nine differential lipids in the differentiation of male SCH (SCH-M) patients and male controls (C-M), while sixty putative biomarkers were identified in the differentiation of female SCH patients (SCH-F) and female controls (C-F). Lipidomic study of gender-differentiated groups, between SCH-M and C-M and between SCH-F and C-F groups, confirmed that lipids metabolism was altered and the content of the majority of the most affected lipid classes, glycerophospholipids (GP), sphingolipids (SP), glycerolipids (GL) and fatty acids (FA), was decreased compared to controls. From differential lipid metabolites with higher content in both SCH-M and SCH-F patients groups compared to their non-psychiatric controls, there were four common lipid molecules: ceramides Cer 34:2, and Cer 34:1, lysophosphatidylcholine LPC 16:0 and triacylglycerol TG 48:2. Significant alteration of lipids metabolism confirmed the importance of metabolic pathways in the pathogenesis of schizophrenia.
Collapse
Affiliation(s)
- Suzana Marković
- University of Belgrade—Faculty of Chemistry, Studentski trg 12–16, 11000 Belgrade, Serbia; (S.M.); (V.T.)
- University of Belgrade—Faculty of Medicine, Institute of Forensic Medicine, Deligradska 31a, 11000 Belgrade, Serbia
| | - Milka Jadranin
- University of Belgrade—Institute of Chemistry, Technology and Metallurgy, Department of Chemistry, Njegoševa 12, 11000 Belgrade, Serbia;
| | - Zoran Miladinović
- Institute of General and Physical Chemistry, Studentski trg 12–16, 11158 Belgrade, Serbia;
| | - Aleksandra Gavrilović
- Special Hospital for Psychiatric Diseases “Kovin”, Cara Lazara 253, 26220 Kovin, Serbia;
| | - Nataša Avramović
- University of Belgrade—Faculty of Medicine, Institute of Medical Chemistry, Višegradska 26, 11000 Belgrade, Serbia;
| | - Marija Takić
- University of Belgrade—Institute for Medical Research, National Institute of Republic of Serbia, Center of Research Excellence for Nutrition and Metabolism, Group for Nutrition and Metabolism, Tadeuša Košćuška 1, 11000 Belgrade, Serbia;
| | - Ljubica Tasic
- Institute of Chemistry, Organic Chemistry Department, Universidade Estadual de Campinas, UNICAMP, Campinas 13083-970, SP, Brazil;
| | - Vele Tešević
- University of Belgrade—Faculty of Chemistry, Studentski trg 12–16, 11000 Belgrade, Serbia; (S.M.); (V.T.)
| | - Boris Mandić
- University of Belgrade—Faculty of Chemistry, Studentski trg 12–16, 11000 Belgrade, Serbia; (S.M.); (V.T.)
| |
Collapse
|
4
|
Gong Y, Ding W, Wang P, Wu Q, Yao X, Yang Q. Evaluating Machine Learning Methods of Analyzing Multiclass Metabolomics. J Chem Inf Model 2023; 63:7628-7641. [PMID: 38079572 DOI: 10.1021/acs.jcim.3c01525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2023]
Abstract
Multiclass metabolomic studies have become popular for revealing the differences in multiple stages of complex diseases, various lifestyles, or the effects of specific treatments. In multiclass metabolomics, there are multiple data manipulation steps for analyzing raw data, which consist of data filtering, the imputation of missing values, data normalization, marker identification, sample separation, classification, and so on. In each step, several to dozens of machine learning methods can be chosen for the given data set, with potentially hundreds or thousands of method combinations in the whole data processing chain. Therefore, a clear understanding of these machine learning methods is helpful for selecting an appropriate method combination for obtaining stable and reliable analytical results of specific data. However, there has rarely been an overall introduction or evaluation of these methods based on multiclass metabolomic data. Herein, detailed descriptions of these machine learning methods in multiple data manipulation steps are reviewed. Moreover, an assessment of these methods was performed using a benchmark data set for multiclass metabolomics. First, 12 imputation methods for imputing missing values were evaluated based on the PSS (Procrustes statistical shape analysis) and NRMSE (normalized root-mean-square error) values. Second, 17 normalization methods for processing multiclass metabolomic data were evaluated by applying the PMAD (pooled median absolute deviation) value. Third, different methods of identifying markers of multiclass metabolomics were evaluated based on the CWrel (relative weighted consistency) value. Fourth, nine classification methods for constructing multiclass models were assessed using the AUC (area under the curve) value. Performance evaluations of machine learning methods are highly recommended to select the most appropriate method combination before performing the final analysis of the given data. Overall, detailed descriptions and evaluation of various machine learning methods are expected to improve analyses of multiclass metabolomic data.
Collapse
Affiliation(s)
- Yaguo Gong
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Wei Ding
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Panpan Wang
- College of Chemistry and Pharmaceutical Engineering, Huanghuai University, Zhumadian 463000, China
| | - Qibiao Wu
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Xiaojun Yao
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- Department of Bioinformatics, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| |
Collapse
|