1
|
Gu Z, El Bouhaddani S, Pei J, Houwing-Duistermaat J, Uh HW. Statistical integration of two omics datasets using GO2PLS. BMC Bioinformatics 2021; 22:131. [PMID: 33736604 PMCID: PMC7977326 DOI: 10.1186/s12859-021-03958-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 01/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace. RESULTS The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease. CONCLUSIONS GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
Collapse
Affiliation(s)
- Zhujie Gu
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands.
| | - Said El Bouhaddani
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands
| | - Jiayi Pei
- Department of Cardiology, UMC Utrecht, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands
| | - Jeanine Houwing-Duistermaat
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands.,Department of Statistics, University of Leeds, LS2 9JT, Leeds, UK.,Department of Statistical Sciences, University of Bologna, Bologna, Italy
| | - Hae-Won Uh
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Huispost Str. 6.131, 3508 GA, Utrecht, The Netherlands
| |
Collapse
|
2
|
Fuady AM, El Bouhaddani S, Uh HW, Houwing-Duistermaat J. Estimation of the effect of surrogate multi-omic biomarkers. Theor Biol Forum 2021; 114:59-73. [PMID: 35502731 DOI: 10.19272/202111402006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Multiple technologies which measure the same omics data set but are based on different aspects of the molecules exist. In practice, studies use different technologies and have therefore different biomarkers. An example is the glycan age index, which is constructed by three different ultra-performance liquid chromatography (UPLC) IgG glycans, and is a biomarker for biological age. A second technology is liquid chromatography- mass spectrometry (LCMS). To estimate the effect of a biomarker on an outcome variable, two issues need to be addressed. Firstly, a measurement error is needed to map one technology to the other one using a calibration study. Here, we consider two approaches, namely one based on the chemical properties of the two technologies and one based on the estimation of this relationship using O2PLS. Secondly, the use of an approximation of the biomarker in the main study needs to be taken into account by use of a regression calibration method. The performance of the two approaches is studied via simulations. The methods are used to estimate the relationship between glycan age and menopause. We have data from two cohorts, namely Korcula and Vis. In conclusion, (1) both measurement error models give similar results and suggest that there is an association between the glycan age index and the menopause status, (2) the chemical mapping approach outperforms O2PLS in the low measurement error variance, while on the larger measurement error variance, O2PLS works better, (3) statistical efficiency is lost due to increased noise level by adding irrelevant information.
Collapse
Affiliation(s)
- Angga M Fuady
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands. , Corresponding Author
| | - Said El Bouhaddani
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Utrecht, The Netherlands
| | - Hae-Won Uh
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Utrecht, The Netherlands
| | - Jeanine Houwing-Duistermaat
- Department of Data Science and Biostatistics, UMC Utrecht, div. Julius Centre, Utrecht, The Netherlands. Department of Statistics and Alan Turing Institute, University of Leeds, Leeds, United Kingdom
| |
Collapse
|
3
|
Reiding KR, Ruhaak LR, Uh HW, El Bouhaddani S, van den Akker EB, Plomp R, McDonnell LA, Houwing-Duistermaat JJ, Slagboom PE, Beekman M, Wuhrer M. Human Plasma N-glycosylation as Analyzed by Matrix-Assisted Laser Desorption/Ionization-Fourier Transform Ion Cyclotron Resonance-MS Associates with Markers of Inflammation and Metabolic Health. Mol Cell Proteomics 2016; 16:228-242. [PMID: 27932526 DOI: 10.1074/mcp.m116.065250] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Revised: 12/01/2016] [Indexed: 12/22/2022] Open
Abstract
Glycosylation is an abundant co- and post-translational protein modification of importance to protein processing and activity. Although not template-defined, glycosylation does reflect the biological state of an organism and is a high-potential biomarker for disease and patient stratification. However, to interpret a complex but informative sample like the total plasma N-glycome, it is important to establish its baseline association with plasma protein levels and systemic processes. Thus far, large-scale studies (n >200) of the total plasma N-glycome have been performed with methods of chromatographic and electrophoretic separation, which, although being informative, are limited in resolving the structural complexity of plasma N-glycans. MS has the opportunity to contribute additional information on, among others, antennarity, sialylation, and the identity of high-mannose type species.Here, we have used matrix-assisted laser desorption/ionization (MALDI)-Fourier transform ion cyclotron resonance (FTICR)-MS to study the total plasma N-glycome of 2144 healthy middle-aged individuals from the Leiden Longevity Study, to allow association analysis with markers of metabolic health and inflammation. To achieve this, N-glycans were enzymatically released from their protein backbones, labeled at the reducing end with 2-aminobenzoic acid, and following purification analyzed by negative ion mode intermediate pressure MALDI-FTICR-MS. In doing so, we achieved the relative quantification of 61 glycan compositions, ranging from Hex4HexNAc2 to Hex7HexNAc6dHex1Neu5Ac4, as well as that of 39 glycosylation traits derived thereof. Next to confirming known associations of glycosylation with age and sex by MALDI-FTICR-MS, we report novel associations with C-reactive protein (CRP), interleukin 6 (IL-6), body mass index (BMI), leptin, adiponectin, HDL cholesterol, triglycerides (TG), insulin, gamma-glutamyl transferase (GGT), alanine aminotransferase (ALT), and smoking. Overall, the bisection, galactosylation, and sialylation of diantennary species, the sialylation of tetraantennary species, and the size of high-mannose species proved to be important plasma characteristics associated with inflammation and metabolic health.
Collapse
Affiliation(s)
- Karli R Reiding
- From the ‡Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - L Renee Ruhaak
- §Department of Clinical Chemistry and Laboratory Medicine, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Hae-Won Uh
- ¶Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Said El Bouhaddani
- ¶Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Erik B van den Akker
- ¶Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands.,**Pattern Recognition & Bioinformatics, Delft University of Technology, 2600 GA Delft, The Netherlands
| | - Rosina Plomp
- From the ‡Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Liam A McDonnell
- From the ‡Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Jeanine J Houwing-Duistermaat
- ¶Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands.,‡‡Department of Statistics, University of Leeds, LS2 9JT Leeds, United Kingdom
| | - P Eline Slagboom
- ‖Department of Molecular Epidemiology, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Marian Beekman
- ‖Department of Molecular Epidemiology, Leiden University Medical Center, 2300 RC Leiden, The Netherlands
| | - Manfred Wuhrer
- From the ‡Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands;
| |
Collapse
|
4
|
Abstract
Background Rapid computational and technological developments made large amounts of omics data available in different biological levels. It is becoming clear that simultaneous data analysis methods are needed for better interpretation and understanding of the underlying systems biology. Different methods have been proposed for this task, among them Partial Least Squares (PLS) related methods. To also deal with orthogonal variation, systematic variation in the data unrelated to one another, we consider the Two-way Orthogonal PLS (O2PLS): an integrative data analysis method which is capable of modeling systematic variation, while providing more parsimonious models aiding interpretation. Results A simulation study to assess the performance of O2PLS showed positive results in both low and higher dimensions. More noise (50 % of the data) only affected the systematic part estimates. A data analysis was conducted using data on metabolomics and transcriptomics from a large Finnish cohort (DILGOM). A previous sequential study, using the same data, showed significant correlations between the Lipo-Leukocyte (LL) module and lipoprotein metabolites. The O2PLS results were in agreement with these findings, identifying almost the same set of co-varying variables. Moreover, our integrative approach identified other associative genes and metabolites, while taking into account systematic variation in the data. Including orthogonal components enhanced overall fit, but the orthogonal variation was difficult to interpret. Conclusions Simulations showed that the O2PLS estimates were close to the true parameters in both low and higher dimensions. In the presence of more noise (50 %), the orthogonal part estimates could not distinguish well between joint and unique variation. The joint estimates were not systematically affected. Simultaneous analysis with O2PLS on metabolome and transcriptome data showed that the LL module, together with VLDL and HDL metabolites, were important for the metabolomic and transcriptomic relation. This is in agreement with an earlier study. In addition more gene expression and metabolites are identified being important for the joint covariation.
Collapse
Affiliation(s)
- Said El Bouhaddani
- Department of Medical Statistics and Bioinformatics, LUMC, Albinusdreef 2, Leiden, 2300, RC, The Netherlands.
| | - Jeanine Houwing-Duistermaat
- Department of Medical Statistics and Bioinformatics, LUMC, Albinusdreef 2, Leiden, 2300, RC, The Netherlands.
| | - Perttu Salo
- National Institute for Health and Welfare (THL), Mannerheimintie 166, Helsinki, FI-00271, Finland.
| | - Markus Perola
- National Institute for Health and Welfare (THL), Mannerheimintie 166, Helsinki, FI-00271, Finland.
| | - Geurt Jongbloed
- Department of Statistics, EEMCS, TU Delft, Mekelweg 4, Delft, 2628, CD, The Netherlands.
| | - Hae-Won Uh
- Department of Medical Statistics and Bioinformatics, LUMC, Albinusdreef 2, Leiden, 2300, RC, The Netherlands.
| |
Collapse
|