1
|
Wang H, Lu H, Sun J, Safo SE. Interpretable deep learning methods for multiview learning. BMC Bioinformatics 2024; 25:69. [PMID: 38350879 DOI: 10.1186/s12859-024-05679-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 01/29/2024] [Indexed: 02/15/2024] Open
Abstract
BACKGROUND Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. RESULTS We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) to learn nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and three real-world data for classification, clustering, and reconstruction tasks. For the classification tasks, iDeepViewLearn had competitive classification results with state-of-the-art methods in various settings. For the clustering task, we detected molecular clusters that differed in their 10-year survival rates for breast cancer. For the reconstruction task, we were able to reconstruct handwritten images using a few pixels while achieving competitive classification accuracy. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning. CONCLUSION iDeepViewLearn is an innovative deep learning model capable of capturing nonlinear relationships between data from multiple views while achieving feature selection. It is fully open source and is freely available at https://github.com/lasandrall/iDeepViewLearn .
Collapse
Affiliation(s)
- Hengkang Wang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, 55455, USA
| | - Han Lu
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, 55414, USA
| | - Ju Sun
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, 55455, USA
| | - Sandra E Safo
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, 55414, USA.
| |
Collapse
|
2
|
Kunz M, Rott KW, Hurwitz E, Kunisaki K, Sun J, Wilkins KJ, Islam JY, Patel R, Safo SE. The Intersections of COVID-19, HIV, and Race/Ethnicity: Machine Learning Methods to Identify and Model Risk Factors for Severe COVID-19 in a Large U.S. National Dataset. AIDS Behav 2024:10.1007/s10461-024-04266-6. [PMID: 38326668 DOI: 10.1007/s10461-024-04266-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/03/2024] [Indexed: 02/09/2024]
Abstract
We investigate risk factors for severe COVID-19 in persons living with HIV (PWH), including among racialized PWH, using the U.S. population-sampled National COVID Cohort Collaborative (N3C) data released from January 1, 2020 to October 10, 2022. We defined severe COVID-19 as hospitalized with invasive mechanical ventilation, extracorporeal membrane oxygenation, discharge to hospice or death. We used machine learning methods to identify highly ranked, uncorrelated factors predicting severe COVID-19, and used multivariable logistic regression models to assess the associations of these variables with severe COVID-19 in several models, including race-stratified models. There were 3 241 627 individuals with incident COVID-19 cases and 81 549 (2.5%) with severe COVID-19, of which 17 445 incident COVID-19 and 1 020 (5.8%) severe cases were among PWH. The top highly ranked factors of severe COVID-19 were age, congestive heart failure (CHF), dementia, renal disease, sodium concentration, smoking status, and sex. Among PWH, age and sodium concentration were important predictors of COVID-19 severity, and the effect of sodium concentration was more pronounced in Hispanics (aOR 4.11 compared to aOR range: 1.47-1.88 for Black, White, and Other non-Hispanics). Dementia, CHF, and renal disease was associated with higher odds of severe COVID-19 among Black, Hispanic, and Other non-Hispanics PWH, respectively. Our findings suggest that the impact of factors, especially clinical comorbidities, predictive of severe COVID-19 among PWH varies by racialized groups, highlighting a need to account for race and comorbidity burden when assessing the risk of PWH developing severe COVID-19.
Collapse
Affiliation(s)
- Miranda Kunz
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN, USA
| | - Kollin W Rott
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN, USA
| | - Eric Hurwitz
- Institute of Molecular Medicine, Virginia Commonwealth University, Richmond, VA, USA
| | - Ken Kunisaki
- Minneapolis Veterans Affairs Health Care System, Minneapolis, MN, USA
- Medical School, University of Minnesota, Minneapolis, MN, USA
| | - Jing Sun
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Kenneth J Wilkins
- Biostatistics Program, Office of the Director, National Institute of Diabetes & Digestive & Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Jessica Y Islam
- Cancer Epidemiology Program, Center for Immunization and Infection Research in Cancer, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Rena Patel
- Division of Infectious Diseases, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Sandra E Safo
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN, USA.
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, 2221 University Avenue SE, Suite 200, Minneapolis, MN, USA.
| |
Collapse
|
3
|
Palzer EF, Safo SE. mvlearnR and Shiny App for multiview learning. Bioinform Adv 2024; 4:vbae005. [PMID: 38304121 PMCID: PMC10833139 DOI: 10.1093/bioadv/vbae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 11/20/2023] [Accepted: 01/14/2024] [Indexed: 02/03/2024]
Abstract
Summary The package mvlearnR and accompanying Shiny App is intended for integrating data from multiple sources or views or modalities (e.g. genomics, proteomics, clinical, and demographic data). Most existing software packages for multiview learning are decentralized and offer limited capabilities, making it difficult for users to perform comprehensive integrative analysis. The new package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow. For users with limited programming language, we provide a Shiny Application to facilitate data integration anywhere and on any device. The methods have potential to offer deeper insights into complex disease mechanisms. Availability and implementation mvlearnR is available from the following GitHub repository: https://github.com/lasandrall/mvlearnR. The web application is hosted on shinyapps.io and available at: https://multi-viewlearn.shinyapps.io/MultiView_Modeling/.
Collapse
Affiliation(s)
- Elise F Palzer
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minnesota 55414, United States
| | - Sandra E Safo
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minnesota 55414, United States
| |
Collapse
|
4
|
Castro-Pearson S, Samorodnitsky S, Yang K, Lotfi-Emran S, Ingraham NE, Bramante C, Jones EK, Greising S, Yu M, Steffen B, Svensson J, Åhlberg E, Österberg B, Wacker D, Guan W, Puskarich M, Smed-Sörensen A, Lusczek E, Safo SE, Tignanelli CJ. Development of a proteomic signature associated with severe disease for patients with COVID-19 using data from 5 multicenter, randomized, controlled, and prospective studies. Sci Rep 2023; 13:20315. [PMID: 37985892 PMCID: PMC10661735 DOI: 10.1038/s41598-023-46343-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 10/31/2023] [Indexed: 11/22/2023] Open
Abstract
Significant progress has been made in preventing severe COVID-19 disease through the development of vaccines. However, we still lack a validated baseline predictive biologic signature for the development of more severe disease in both outpatients and inpatients infected with SARS-CoV-2. The objective of this study was to develop and externally validate, via 5 international outpatient and inpatient trials and/or prospective cohort studies, a novel baseline proteomic signature, which predicts the development of moderate or severe (vs mild) disease in patients with COVID-19 from a proteomic analysis of 7000 + proteins. The secondary objective was exploratory, to identify (1) individual baseline protein levels and/or (2) protein level changes within the first 2 weeks of acute infection that are associated with the development of moderate/severe (vs mild) disease. For model development, samples collected from 2 randomized controlled trials were used. Plasma was isolated and the SomaLogic SomaScan platform was used to characterize protein levels for 7301 proteins of interest for all studies. We dichotomized 113 patients as having mild or moderate/severe COVID-19 disease. An elastic net approach was used to develop a predictive proteomic signature. For validation, we applied our signature to data from three independent prospective biomarker studies. We found 4110 proteins measured at baseline that significantly differed between patients with mild COVID-19 and those with moderate/severe COVID-19 after adjusting for multiple hypothesis testing. Baseline protein expression was associated with predicted disease severity with an error rate of 4.7% (AUC = 0.964). We also found that five proteins (Afamin, I-309, NKG2A, PRS57, LIPK) and patient age serve as a signature that separates patients with mild COVID-19 and patients with moderate/severe COVID-19 with an error rate of 1.77% (AUC = 0.9804). This panel was validated using data from 3 external studies with AUCs of 0.764 (Harvard University), 0.696 (University of Colorado), and 0.893 (Karolinska Institutet). In this study we developed and externally validated a baseline COVID-19 proteomic signature associated with disease severity for potential use in both outpatients and inpatients with COVID-19.
Collapse
Affiliation(s)
- Sandra Castro-Pearson
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Sarah Samorodnitsky
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Kaifeng Yang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Sahar Lotfi-Emran
- Department of Medicine, University of Minnesota, Minneapolis, MN, USA
| | | | - Carolyn Bramante
- Department of Medicine, University of Minnesota, Minneapolis, MN, USA
| | - Emma K Jones
- Department of Surgery, University of Minnesota, 420 Delaware St SE, Minneapolis, MN, 55455, USA
| | - Sarah Greising
- School of Kinesiology, University of Minnesota, Minneapolis, MN, USA
| | - Meng Yu
- Division of Immunology and Allergy, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - Brian Steffen
- Department of Surgery, University of Minnesota, 420 Delaware St SE, Minneapolis, MN, 55455, USA
| | - Julia Svensson
- Division of Immunology and Allergy, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - Eric Åhlberg
- Division of Immunology and Allergy, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - Björn Österberg
- Division of Immunology and Allergy, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - David Wacker
- Department of Medicine, University of Minnesota, Minneapolis, MN, USA
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Michael Puskarich
- Department of Emergency Medicine, University of Minnesota, Minneapolis, MN, USA
- Department of Emergency Medicine, Hennepin County Medical Center, Minneapolis, MN, USA
| | - Anna Smed-Sörensen
- Division of Immunology and Allergy, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - Elizabeth Lusczek
- Department of Surgery, University of Minnesota, 420 Delaware St SE, Minneapolis, MN, 55455, USA
| | - Sandra E Safo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Christopher J Tignanelli
- Department of Surgery, University of Minnesota, 420 Delaware St SE, Minneapolis, MN, 55455, USA.
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
5
|
Yang K, Kang Z, Guan W, Lotfi-Emran S, Mayer ZJ, Guerrero CR, Steffen BT, Puskarich MA, Tignanelli CJ, Lusczek E, Safo SE. Developing A Baseline Metabolomic Signature Associated with COVID-19 Severity: Insights from Prospective Trials Encompassing 13 U.S. Centers. Metabolites 2023; 13:1107. [PMID: 37999202 PMCID: PMC10672920 DOI: 10.3390/metabo13111107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 10/14/2023] [Accepted: 10/16/2023] [Indexed: 11/25/2023] Open
Abstract
Metabolic disease is a significant risk factor for severe COVID-19 infection, but the contributing pathways are not yet fully elucidated. Using data from two randomized controlled trials across 13 U.S. academic centers, our goal was to characterize metabolic features that predict severe COVID-19 and define a novel baseline metabolomic signature. Individuals (n = 133) were dichotomized as having mild or moderate/severe COVID-19 disease based on the WHO ordinal scale. Blood samples were analyzed using the Biocrates platform, providing 630 targeted metabolites for analysis. Resampling techniques and machine learning models were used to determine metabolomic features associated with severe disease. Ingenuity Pathway Analysis (IPA) was used for functional enrichment analysis. To aid in clinical decision making, we created baseline metabolomics signatures of low-correlated molecules. Multivariable logistic regression models were fit to associate these signatures with severe disease on training data. A three-metabolite signature, lysophosphatidylcholine a C17:0, dihydroceramide (d18:0/24:1), and triacylglyceride (20:4_36:4), resulted in the best discrimination performance with an average test AUROC of 0.978 and F1 score of 0.942. Pathways related to amino acids were significantly enriched from the IPA analyses, and the mitogen-activated protein kinase kinase 5 (MAP2K5) was differentially activated between groups. In conclusion, metabolites related to lipid metabolism efficiently discriminated between mild vs. moderate/severe disease. SDMA and GABA demonstrated the potential to discriminate between these two groups as well. The mitogen-activated protein kinase kinase 5 (MAP2K5) regulator is differentially activated between groups, suggesting further investigation as a potential therapeutic pathway.
Collapse
Affiliation(s)
- Kaifeng Yang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA (S.E.S.)
| | - Zhiyu Kang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA (S.E.S.)
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA (S.E.S.)
| | - Sahar Lotfi-Emran
- Department of Medicine, University of Minnesota, Minneapolis, MN 55455, USA
| | - Zachary J. Mayer
- Center for Metabolomics and Proteomics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Candace R. Guerrero
- Center for Metabolomics and Proteomics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Brian T. Steffen
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA (E.L.)
| | - Michael A. Puskarich
- Department of Emergency Medicine, University of Minnesota, Minneapolis, MN 55455, USA
- Department of Emergency Medicine, Hennepin County Medical Center, Minneapolis, MN 55455, USA
| | - Christopher J. Tignanelli
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA (E.L.)
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Elizabeth Lusczek
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA (E.L.)
| | - Sandra E. Safo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA (S.E.S.)
| |
Collapse
|
6
|
Safo SE, Haine L, Baker J, Reilly C, Duprez D, Neaton JD, Jain MK, Arenas‐Pinto A, Polizzotto M, Staub T. Derivation of a Protein Risk Score for Cardiovascular Disease Among a Multiracial and Multiethnic HIV+ Cohort. J Am Heart Assoc 2023; 12:e027273. [PMID: 37345752 PMCID: PMC10356060 DOI: 10.1161/jaha.122.027273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 02/28/2023] [Indexed: 06/23/2023]
Abstract
Background Cardiovascular disease risk prediction models underestimate CVD risk in people living with HIV (PLWH). Our goal is to derive a risk score based on protein biomarkers that could be used to predict CVD in PLWH. Methods and Results In a matched case-control study, we analyzed normalized protein expression data for participants enrolled in 1 of 4 trials conducted by INSIGHT (International Network for Strategic Initiatives in Global HIV Trials). We used dimension reduction, variable selection and resampling methods, and multivariable conditional logistic regression models to determine candidate protein biomarkers and to generate a protein score for predicting CVD in PLWH. We internally validated our findings using bootstrap. A protein score that was derived from 8 proteins (including HGF [hepatocyte growth factor] and interleukin-6) was found to be associated with an increased risk of CVD after adjustment for CVD and HIV factors (odds ratio: 2.17 [95% CI: 1.58-2.99]). The protein score improved CVD prediction when compared with predicting CVD risk using the individual proteins that comprised the protein score. Individuals with a protein score above the median score were 3.10 (95% CI, 1.83-5.41) times more likely to develop CVD than those with a protein score below the median score. Conclusions A panel of blood biomarkers may help identify PLWH at a high risk for developing CVD. If validated, such a score could be used in conjunction with established factors to identify CVD at-risk individuals who might benefit from aggressive risk reduction, ultimately shedding light on CVD pathogenesis in PLWH.
Collapse
Affiliation(s)
| | | | - Jason Baker
- Hennepin County Medical CenterMinneapolisMNUSA
| | | | | | | | | | - Alejandro Arenas‐Pinto
- MRC Clinical Trials Unit at University College London Institute of Clinical Trials & MethodologyLondonUK
| | | | | | | |
Collapse
|
7
|
Lipman D, Safo SE, Chekouo T. Integrative multi-omics approach for identifying molecular signatures and pathways and deriving and validating molecular scores for COVID-19 severity and status. BMC Genomics 2023; 24:319. [PMID: 37308820 PMCID: PMC10259816 DOI: 10.1186/s12864-023-09410-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Accepted: 05/25/2023] [Indexed: 06/14/2023] Open
Abstract
BACKGROUND There is still more to learn about the pathobiology of COVID-19. A multi-omic approach offers a holistic view to better understand the mechanisms of COVID-19. We used state-of-the-art statistical learning methods to integrate genomics, metabolomics, proteomics, and lipidomics data obtained from 123 patients experiencing COVID-19 or COVID-19-like symptoms for the purpose of identifying molecular signatures and corresponding pathways associated with the disease. RESULTS We constructed and validated molecular scores and evaluated their utility beyond clinical factors known to impact disease status and severity. We identified inflammation- and immune response-related pathways, and other pathways, providing insights into possible consequences of the disease. CONCLUSIONS The molecular scores we derived were strongly associated with disease status and severity and can be used to identify individuals at a higher risk for developing severe disease. These findings have the potential to provide further, and needed, insights into why certain individuals develop worse outcomes.
Collapse
Affiliation(s)
- Danika Lipman
- Department of Mathematics and Statistics, University of Calgary, Calgary, Canada
| | - Sandra E Safo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minnesota, USA.
| | - Thierry Chekouo
- Department of Mathematics and Statistics, University of Calgary, Calgary, Canada.
- Division of Biostatistics, School of Public Health, University of Minnesota, Minnesota, USA.
| |
Collapse
|
8
|
Reilly CS, Borges ÁH, Baker JV, Safo SE, Sharma S, Polizzotto MN, Pankow JS, Hu X, Sherman BT, Babiker AG, Lundgren JD, Lane HC. Investigation of Causal Effects of Protein Biomarkers on Cardiovascular Disease in Persons With HIV. J Infect Dis 2023; 227:951-960. [PMID: 36580481 PMCID: PMC10319949 DOI: 10.1093/infdis/jiac496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 12/19/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND There is an incompletely understood increased risk for cardiovascular disease (CVD) among people with HIV (PWH). We investigated if a collection of biomarkers were associated with CVD among PWH. Mendelian randomization (MR) was used to identify potentially causal associations. METHODS Data from follow-up in 4 large trials among PWH were used to identify 131 incident CVD cases and they were matched to 259 participants without incident CVD (controls). Tests of associations between 460 baseline protein levels and case status were conducted. RESULTS Univariate analysis found CLEC6A, HGF, IL-6, IL-10RB, and IGFBP7 as being associated with case status and a multivariate model identified 3 of these: CLEC6A (odds ratio [OR] = 1.48, P = .037), HGF (OR = 1.83, P = .012), and IL-6 (OR = 1.45, P = .016). MR methods identified 5 significantly associated proteins: AXL, CHI3L1, GAS6, IL-6RA, and SCGB3A2. CONCLUSIONS These results implicate inflammatory and fibrotic processes as contributing to CVD. While some of these biomarkers are well established in the general population and in PWH (IL-6 and its receptor), some are novel to PWH (HGF, AXL, and GAS6) and some are novel overall (CLEC6A). Further investigation into the uniqueness of these biomarkers in PWH and the role of these biomarkers as targets among PWH is warranted.
Collapse
Affiliation(s)
- Cavan S Reilly
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | - Jason V Baker
- HIV Medicine, Infectious Diseases, Hennepin County Medical Center, Minneapolis, Minnesota, USA
| | - Sandra E Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Shweta Sharma
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Mark N Polizzotto
- Department of Medicine, Australian National University, Canberra, Australia
| | - James S Pankow
- Division of Epidemiology and Community Health, University of Minnesota, Minneapolis, Minnesota, USA
| | - Xiaojun Hu
- Animal and Plant Inspection Service, US Department of Agriculture, Beltsville, Maryland, USA
| | - Brad T Sherman
- Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratories, Frederick, Maryland, USA
| | - Abdel G Babiker
- Epidemiology and Medical Statistics, University College London, London, United Kingdom
| | - Jens D Lundgren
- Department of Infectious Diseases, University of Copenhagen, Copenhagen, Denmark
| | - H Clifford Lane
- Division of Clinical Research, National Institutes of Allergy and Infectious Diseases, Bethesda, Maryland, USA
| |
Collapse
|
9
|
Zhang W, Wendt C, Bowler R, Hersh CP, Safo SE. Robust integrative biclustering for multi-view data. Stat Methods Med Res 2022; 31:2201-2216. [PMID: 36113157 PMCID: PMC10153449 DOI: 10.1177/09622802221122427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In many biomedical research, multiple views of data (e.g. genomics, proteomics) are available, and a particular interest might be the detection of sample subgroups characterized by specific groups of variables. Biclustering methods are well-suited for this problem as they assume that specific groups of variables might be relevant only to specific groups of samples. Many biclustering methods exist for detecting row-column clusters in a view but few methods exist for data from multiple views. The few existing algorithms are heavily dependent on regularization parameters for getting row-column clusters, and they impose unnecessary burden on users thus limiting their use in practice. We extend an existing biclustering method based on sparse singular value decomposition for single-view data to data from multiple views. Our method, integrative sparse singular value decomposition (iSSVD), incorporates stability selection to control Type I error rates, estimates the probability of samples and variables to belong to a bicluster, finds stable biclusters, and results in interpretable row-column associations. Simulations and real data analyses show that integrative sparse singular value decomposition outperforms several other single- and multi-view biclustering methods and is able to detect meaningful biclusters. iSSVD is a user-friendly, computationally efficient algorithm that will be useful in many disease subtyping applications.
Collapse
Affiliation(s)
- Weijie Zhang
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| | - Christine Wendt
- Division of Pulmonary, Allergy and Critical Care, 5635University of Minnesota, MN, USA
| | - Russel Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, 551774National Jewish Health, Denver, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital, 1811Harvard Medical School, USA
| | - Sandra E Safo
- Division of Biostatistics, 5635University of Minnesota, MN, USA
| |
Collapse
|
10
|
Palzer EF, Wendt CH, Bowler RP, Hersh CP, Safo SE, Lock EF. sJIVE: Supervised Joint and Individual Variation Explained. Comput Stat Data Anal 2022; 175:107547. [PMID: 36119152 PMCID: PMC9481062 DOI: 10.1016/j.csda.2022.107547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Analyzing multi-source data, which are multiple views of data on the same subjects, has become increasingly common in molecular biomedical research. Recent methods have sought to uncover underlying structure and relationships within and/or between the data sources, and other methods have sought to build a predictive model for an outcome using all sources. However, existing methods that do both are presently limited because they either (1) only consider data structure shared by all datasets while ignoring structures unique to each source, or (2) they extract underlying structures first without consideration to the outcome. The proposed method, supervised joint and individual variation explained (sJIVE), can simultaneously (1) identify shared (joint) and source-specific (individual) underlying structure and (2) build a linear prediction model for an outcome using these structures. These two components are weighted to compromise between explaining variation in the multi-source data and in the outcome. Simulations show sJIVE to outperform existing methods when large amounts of noise are present in the multi-source data. An application to data from the COPDGene study explores gene expression and proteomic patterns associated with lung function.
Collapse
Affiliation(s)
- Elise F. Palzer
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| | - Christine H. Wendt
- Division of Pulmonary, Allergy and Critical Care, University of Minnesota, Minneapolis, 55455, USA
| | - Russell P. Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, National Jewish Health, Denver, CO, USA
| | - Craig P. Hersh
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| | - Eric F. Lock
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| |
Collapse
|
11
|
Hilafu H, Safo SE. Sparse sliced inverse regression for high dimensional data analysis. BMC Bioinformatics 2022; 23:168. [PMID: 35525975 PMCID: PMC9080177 DOI: 10.1186/s12859-022-04700-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 04/21/2022] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Dimension reduction and variable selection play a critical role in the analysis of contemporary high-dimensional data. The semi-parametric multi-index model often serves as a reasonable model for analysis of such high-dimensional data. The sliced inverse regression (SIR) method, which can be formulated as a generalized eigenvalue decomposition problem, offers a model-free estimation approach for the indices in the semi-parametric multi-index model. Obtaining sparse estimates of the eigenvectors that constitute the basis matrix that is used to construct the indices is desirable to facilitate variable selection, which in turn facilitates interpretability and model parsimony. RESULTS To this end, we propose a group-Dantzig selector type formulation that induces row-sparsity to the sliced inverse regression dimension reduction vectors. Extensive simulation studies are carried out to assess the performance of the proposed method, and compare it with other state of the art methods in the literature. CONCLUSION The proposed method is shown to yield competitive estimation, prediction, and variable selection performance. Three real data applications, including a metabolomics depression study, are presented to demonstrate the method's effectiveness in practice.
Collapse
Affiliation(s)
- Haileab Hilafu
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, TN 37996 USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455 USA
| |
Collapse
|
12
|
Lipman D, Safo SE, Chekouo T. Multi-omic analysis reveals enriched pathways associated with COVID-19 and COVID-19 severity. PLoS One 2022; 17:e0267047. [PMID: 35468151 PMCID: PMC9038205 DOI: 10.1371/journal.pone.0267047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 03/31/2022] [Indexed: 11/23/2022] Open
Abstract
COVID-19 is a disease characterized by its seemingly unpredictable clinical outcomes. In order to better understand the molecular signature of the disease, a recent multi-omics study was done which looked at correlations between biomolecules and used a tree- based machine learning approach to predict clinical outcomes. This study specifically looked at patients admitted to the hospital experiencing COVID-19 or COVID-19 like symptoms. In this paper we examine the same multi-omics data, however we take a different approach, and we identify stable molecules of interest for further pathway analysis. We used stability selection, regularized regression models, enrichment analysis, and principal components analysis on proteomics, metabolomics, lipidomics, and RNA sequencing data, and we determined key molecules and biological pathways in disease severity, and disease status. In addition to the individual omics analyses, we perform the integrative method Sparse Multiple Canonical Correlation Analysis to analyse relationships of the different view of data. Our findings suggest that COVID-19 status is associated with the cell cycle and death, as well as the inflammatory response. This relationship is reflected in all four sets of molecules analyzed. We further observe that the metabolic processes, particularly processes to do with vitamin absorption and cholesterol are implicated in COVID-19 status and severity.
Collapse
Affiliation(s)
- Danika Lipman
- Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, United States of America
- * E-mail: (SES); (TC)
| | - Thierry Chekouo
- Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
- * E-mail: (SES); (TC)
| |
Collapse
|
13
|
Wang J, Safo SE. Deep IDA: A Deep Learning Method for Integrative Discriminant Analysis of Multi-View Data with Feature Ranking-An Application to COVID-19 severity. ArXiv 2021:arXiv:2111.09964v2. [PMID: 34815984 PMCID: PMC8609900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Revised: 11/24/2021] [Indexed: 12/27/2022]
Abstract
COVID-19 severity is due to complications from SARS-Cov-2 but the clinical course of the infection varies for individuals, emphasizing the need to better understand the disease at the molecular level. We use clinical and multiple molecular data (or views) obtained from patients with and without COVID-19 who were (or not) admitted to the intensive care unit to shed light on COVID-19 severity. Methods for jointly associating the views and separating the COVID-19 groups (i.e., one-step methods) have focused on linear relationships. The relationships between the views and COVID-19 patient groups, however, are too complex to be understood solely by linear methods. Existing nonlinear one-step methods cannot be used to identify signatures to aid in our understanding of the complexity of the disease. We propose Deep IDA (Integrative Discriminant Analysis) to address analytical challenges in our problem of interest. Deep IDA learns nonlinear projections of two or more views that maximally associate the views and separate the classes in each view, and permits feature ranking for interpretable findings. Our applications demonstrate that Deep IDA has competitive classification rates compared to other state-of-the-art methods and is able to identify molecular signatures that facilitate an understanding of COVID-19 severity.
Collapse
Affiliation(s)
- Jiuzhou Wang
- Division of Biostatistics, University of Minnesota, MN
| | - Sandra E Safo
- Division of Biostatistics, University of Minnesota, MN
| |
Collapse
|
14
|
Chekouo T, Safo SE. Bayesian integrative analysis and prediction with application to atherosclerosis cardiovascular disease. Biostatistics 2021; 24:124-139. [PMID: 33969382 PMCID: PMC9960952 DOI: 10.1093/biostatistics/kxab016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 03/31/2021] [Accepted: 04/12/2021] [Indexed: 12/16/2022] Open
Abstract
The problem of associating data from multiple sources and predicting an outcome simultaneously is an important one in modern biomedical research. It has potential to identify multidimensional array of variables predictive of a clinical outcome and to enhance our understanding of the pathobiology of complex diseases. Incorporating functional knowledge in association and prediction models can reveal pathways contributing to disease risk. We propose Bayesian hierarchical integrative analysis models that associate multiple omics data, predict a clinical outcome, allow for prior functional information, and can accommodate clinical covariates. The models, motivated by available data and the need for exploring other risk factors of atherosclerotic cardiovascular disease (ASCVD), are used for integrative analysis of clinical, demographic, and genomics data to identify genetic variants, genes, and gene pathways likely contributing to 10-year ASCVD risk in healthy adults. Our findings revealed several genetic variants, genes, and gene pathways that are highly associated with ASCVD risk, with some already implicated in cardiovascular disease (CVD) risk. Extensive simulations demonstrate the merit of joint association and prediction models over two-stage methods: association followed by prediction.
Collapse
|
15
|
Safo SE, Min EJ, Haine L. Sparse linear discriminant analysis for multiview structured data. Biometrics 2021; 78:612-623. [PMID: 33739448 DOI: 10.1111/biom.13458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 02/15/2021] [Accepted: 03/04/2021] [Indexed: 11/28/2022]
Abstract
Classification methods that leverage the strengths of data from multiple sources (multiview data) simultaneously have enormous potential to yield more powerful findings than two-step methods: association followed by classification. We propose two methods, sparse integrative discriminant analysis (SIDA), and SIDA with incorporation of network information (SIDANet), for joint association and classification studies. The methods consider the overall association between multiview data, and the separation within each view in choosing discriminant vectors that are associated and optimally separate subjects into different classes. SIDANet is among the first methods to incorporate prior structural information in joint association and classification studies. It uses the normalized Laplacian of a graph to smooth coefficients of predictor variables, thus encouraging selection of predictors that are connected. We demonstrate the effectiveness of our methods on a set of synthetic datasets and explore their use in identifying potential nontraditional risk factors that discriminate healthy patients at low versus high risk for developing atherosclerosis cardiovascular disease in 10 years. Our findings underscore the benefit of joint association and classification methods if the goal is to correlate multiview data and to perform classification.
Collapse
Affiliation(s)
- Sandra E Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Eun Jeong Min
- Department of Medical Life Sciences, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
16
|
Groene EA, Valeris-Chacin RJ, Stadelman AM, Safo SE, Cusick SE. Maternal HIV and child anthropometric outcomes over time: an analysis of Zimbabwe demographic health surveys. AIDS 2021; 35:477-484. [PMID: 33252491 PMCID: PMC7855570 DOI: 10.1097/qad.0000000000002772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
OBJECTIVES To understand the association between children's anthropometric measures and maternal HIV status in Zimbabwe and to determine whether these relationships changed over time. DESIGN Data from Demographic Health Surveys in Zimbabwe rounds 2005, 2010, and 2015 were used to conduct cross-sectional analyses of child anthropometric measures (stunting, underweight, and wasting). METHODS Using separate logistic regression models for each of the anthropometric measures, we estimated the adjusted prevalence odds ratio (OR) of stunting, underweight, and wasting in children according to maternal HIV status. Moreover, we evaluated an interaction by survey year to evaluate change over time. RESULTS Children of mothers with HIV had 32% greater odds [OR = 1.32, 95% confidence interval (CI) 1.16-1.5] of stunting, 27% greater odds (OR = 1.27, 95% CI 1.1-1.48) of underweight status and 7% greater odds (OR = 1.07, 95% CI 0.81-1.42) of wasting status, than children of mothers without HIV. These associations between maternal HIV status and child undernutrition did not differ by year (P > 0.05 for all interaction terms). CONCLUSION In Zimbabwe, having a mother who tested positive for HIV at the time of the survey has been associated with greater child undernutrition over the last two decades with no significant change by survey round. This emphasizes the need for continued programming to address nutritional deficiencies, sanitation, and infectious disease prevention in this high-risk population. The greatest impact of maternal HIV status has been on child stunting and underweight, associated with poor long-term child development.
Collapse
Affiliation(s)
- Emily A Groene
- Division of Epidemiology and Community Health, University of Minnesota School of Public Health, Minneapolis
| | | | - Anna M Stadelman
- Division of Epidemiology and Community Health, University of Minnesota School of Public Health, Minneapolis
| | - Sandra E Safo
- Division of Biostatistics, University of Minnesota School of Public Health
| | - Sarah E Cusick
- University of Minnesota School of Medicine, Minneapolis, Minnesota, USA
| |
Collapse
|
17
|
Abstract
BACKGROUND The problem of assessing associations between multiple omics data including genomics and metabolomics data to identify biomarkers potentially predictive of complex diseases has garnered considerable research interest nowadays. A popular epidemiology approach is to consider an association of each of the predictors with each of the response using a univariate linear regression model, and to select predictors that meet a priori specified significance level. Although this approach is simple and intuitive, it tends to require larger sample size which is costly. It also assumes variables for each data type are independent, and thus ignores correlations that exist between variables both within each data type and across the data types. RESULTS We consider a multivariate linear regression model that relates multiple predictors with multiple responses, and to identify multiple relevant predictors that are simultaneously associated with the responses. We assume the coefficient matrix of the responses on the predictors is both row-sparse and of low-rank, and propose a group Dantzig type formulation to estimate the coefficient matrix. CONCLUSION Extensive simulations demonstrate the competitive performance of our proposed method when compared to existing methods in terms of estimation, prediction, and variable selection. We use the proposed method to integrate genomics and metabolomics data to identify genetic variants that are potentially predictive of atherosclerosis cardiovascular disease (ASCVD) beyond well-established risk factors. Our analysis shows some genetic variants that increase prediction of ASCVD beyond some well-established factors of ASCVD, and also suggest a potential utility of the identified genetic variants in explaining possible association between certain metabolites and ASCVD.
Collapse
Affiliation(s)
- Haileab Hilafu
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 37996 TN USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| |
Collapse
|
18
|
Staimez LR, Rhee MK, Deng Y, Safo SE, Butler SM, Legvold BT, Jackson SL, Ford CN, Wilson PWF, Long Q, Phillips LS. Retinopathy develops at similar glucose levels but higher HbA 1c levels in people with black African ancestry compared to white European ancestry: evidence for the need to individualize HbA 1c interpretation. Diabet Med 2020; 37:1049-1057. [PMID: 32125000 DOI: 10.1111/dme.14289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/26/2020] [Indexed: 11/29/2022]
Abstract
AIMS To examine the association of HbA1c and glucose levels with incident diabetic retinopathy according to black African or white European ancestry. METHODS In this retrospective cohort study of 202 500 US Veterans with diabetes (2000-2014), measures included HbA1c , outpatient random serum/plasma glucose, and incident retinopathy [conversion from negative to ≥2 positive evaluations (ICD-9 codes), without a subsequent negative]. RESULTS At baseline, the study population had a mean age of 59.3 years, their mean BMI was 31.9 kg/m2 , HbA1c level was 57 mmol/mol (7.4%) and glucose level was 8.8 mmol/l, and 77% were of white European ancestry (white individuals) and 21% of black African ancestry (black individuals). HbA1c was 0.3% higher in black vs white individuals (P < 0.001), adjusting for baseline age, sex, BMI, estimated glomerular filtration rate (eGFR), haemoglobin, and average systolic blood pressure and glucose. Over 11 years, incident retinopathy occurred in 9% of black and 7% of white individuals, but black individuals had higher HbA1c , glucose, and systolic blood pressure (all P < 0.001); adjusted for these factors, incident retinopathy was reduced in black vs white individuals (P < 0.001). The population incidence of retinopathy (7%) was associated with higher mean baseline HbA1c in individuals with black vs white ancestry [63 mmol/mol (7.9%) vs 58 mmol/mol (7.5%); P < 0.001)], but with similar baseline glucose levels (9.0 vs 9.0 mmol/l; P = 0.660, all adjusted for baseline age, sex and BMI). CONCLUSIONS Since retinopathy occurs at higher HbA1c levels in black people for a given level of average plasma glucose, strategies may be needed to individualize the interpretation of HbA1c measurements.
Collapse
Affiliation(s)
- L R Staimez
- Hubert Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - M K Rhee
- Atlanta Veterans Affairs Medical Centre, Decatur, GA, USA
- Division of Endocrinology and Metabolism, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, USA
| | - Y Deng
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - S E Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA
| | - S M Butler
- Department of Medicine, School of Medicine, Emory University, Atlanta, GA, USA
| | - B T Legvold
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - S L Jackson
- Division for Heart Disease and Stroke Prevention, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - C N Ford
- Hubert Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - P W F Wilson
- Hubert Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
- Atlanta Veterans Affairs Medical Centre, Decatur, GA, USA
- Cardiology Division, Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Q Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - L S Phillips
- Hubert Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
- Atlanta Veterans Affairs Medical Centre, Decatur, GA, USA
- Division of Endocrinology and Metabolism, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, USA
| |
Collapse
|
19
|
Min EJ, Safo SE, Long Q. Penalized co-inertia analysis with applications to -omics data. Bioinformatics 2019; 35:1018-1025. [PMID: 30165424 DOI: 10.1093/bioinformatics/bty726] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Revised: 04/01/2018] [Accepted: 08/23/2018] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Co-inertia analysis (CIA) is a multivariate statistical analysis method that can assess relationships and trends in two sets of data. Recently CIA has been used for an integrative analysis of multiple high-dimensional omics data. However, for classical CIA, all elements in the loading vectors are nonzero, presenting a challenge for the interpretation when analyzing omics data. For other multivariate statistical methods such as canonical correlation analysis (CCA), penalized least squares (PLS), various approaches have been proposed to produce sparse loading vectors via l1-penalization/constraint. We propose a novel CIA method that uses l1-penalization to induce sparsity in estimators of loading vectors. Our method simultaneously conducts model fitting and variable selection. Also, we propose another CIA method that incorporates structure/network information such as those from functional genomics, besides using sparsity penalty so that one can get biologically meaningful and interpretable results. RESULTS Extensive simulations demonstrate that our proposed penalized CIA methods achieve the best or close to the best performance compared to the existing CIA method in terms of feature selection and recovery of true loading vectors. Also, we apply our methods to the integrative analysis of gene expression data and protein abundance data from the NCI-60 cancer cell lines. Our analysis of the NCI-60 cancer cell line data reveals meaningful variables for cancer diseases and biologically meaningful results that are consistent with previous studies. AVAILABILITY AND IMPLEMENTATION Our algorithms are implemented as an R package which is freely available at: https://www.med.upenn.edu/long-lab/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eun Jeong Min
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sandra E Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
20
|
Safo SE, Ahn J, Jeon Y, Jung S. Sparse generalized eigenvalue problem with application to canonical correlation analysis for integrative analysis of methylation and gene expression data. Biometrics 2018; 74:1362-1371. [DOI: 10.1111/biom.12886] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 03/01/2018] [Accepted: 03/01/2018] [Indexed: 11/29/2022]
Affiliation(s)
- Sandra E. Safo
- Division of BiostatisticsUniversity of MinnesotaMinneapolisMinnesotaU.S.A
| | - Jeongyoun Ahn
- Department of StatisticsUniversity of GeorgiaAthensGeorgiaU.S.A
| | - Yongho Jeon
- Department of Applied StatisticsYonsei UniversitySeoulSouth Korea
| | - Sungkyu Jung
- Department of StatisticsUniversity of PittsburghPittsburghPennsylvaniaU.S.A
| |
Collapse
|
21
|
Affiliation(s)
- Sandra E. Safo
- Division of BiostatisticsUniversity of Minnesota Minneapolis Minnesota
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of MedicineUniversity of Pennsylvania Philadelphia Pennsylvania
| |
Collapse
|
22
|
Rhee MK, Safo SE, Jackson SL, Xue W, Olson DE, Long Q, Barb D, Haw JS, Tomolo AM, Phillips LS. Inpatient Glucose Values: Determining the Nondiabetic Range and Use in Identifying Patients at High Risk for Diabetes. Am J Med 2018; 131:443.e11-443.e24. [PMID: 28993187 DOI: 10.1016/j.amjmed.2017.09.021] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2017] [Revised: 07/31/2017] [Accepted: 09/12/2017] [Indexed: 01/02/2023]
Abstract
BACKGROUND Many individuals with diabetes remain undiagnosed, leading to delays in treatment and higher risk for subsequent diabetes complications. Despite recommendations for diabetes screening in high-risk groups, the optimal approach is not known. We evaluated the utility of inpatient glucose levels as an opportunistic screening tool for identifying patients at high risk for diabetes. METHODS We retrospectively examined 462,421 patients in the US Department of Veterans Affairs healthcare system, hospitalized on medical/surgical services in 2000-2010, for ≥3 days, with ≥2 inpatient random plasma glucose (RPG) measurements. All had continuity of care: ≥1 primary care visit and ≥1 glucose measurement within 2 years before hospitalization and yearly for ≥3 years after discharge. Glucose levels during hospitalization and incidence of diabetes within 3 years after discharge in patients without diabetes were evaluated. RESULTS Patients had a mean age of 65.0 years, body mass index of 29.9 kg/m2, and were 96% male, 71% white, and 18% black. Pre-existing diabetes was present in 39.4%, 1.3% were diagnosed during hospitalization, 8.1% were diagnosed 5 years after discharge, and 51.3% were never diagnosed (NonDM). The NonDM group had the lowest mean hospital RPG value (112 mg/dL [6.2 mmol/L]). Having at least 2 RPG values >140 mg/dL (>7.8 mmol/L), the 95th percentile of NonDM hospital glucose, provided 81% specificity for identifying incident diabetes within 3 years after discharge. CONCLUSIONS Screening for diabetes could be considered in patients with at least 2 hospital glucose values at/above the 95th percentile of the nondiabetic range (141 mg/dL [7.8 mmol/L]).
Collapse
Affiliation(s)
- Mary K Rhee
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Division of Endocrinology, Metabolism and Lipids, Department of Medicine, Emory University School of Medicine, Atlanta, Ga.
| | - Sandra E Safo
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Ga
| | - Sandra L Jackson
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Nutrition and Health Sciences, Graduate Division of Biological and Biomedical Sciences, Emory University, Atlanta, Ga; Division of Heart Disease and Stroke Prevention, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, Atlanta, Ga
| | - Wenqiong Xue
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Ga; Boehringer Ingelheim Pharmaceuticals, Inc., Ridgefield, Conn
| | - Darin E Olson
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Division of Endocrinology, Metabolism and Lipids, Department of Medicine, Emory University School of Medicine, Atlanta, Ga
| | - Qi Long
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Ga; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia
| | - Diana Barb
- Division of Endocrinology, Metabolism and Lipids, Department of Medicine, Emory University School of Medicine, Atlanta, Ga; Division of Endocrinology, Diabetes and Metabolism, Department of Medicine, University of Florida College of Medicine, Gainesville
| | - J Sonya Haw
- Division of Endocrinology, Metabolism and Lipids, Department of Medicine, Emory University School of Medicine, Atlanta, Ga
| | - Anne M Tomolo
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Division of General Internal Medicine and Geriatrics, Department of Medicine, Emory University School of Medicine, Atlanta, Ga
| | - Lawrence S Phillips
- Medical Subspecialty/Endocrinology, Atlanta VA Medical Center, Decatur, Ga; Division of Endocrinology, Metabolism and Lipids, Department of Medicine, Emory University School of Medicine, Atlanta, Ga
| |
Collapse
|
23
|
Safo SE, Li S, Long Q. Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information. Biometrics 2018; 74:300-312. [PMID: 28482123 PMCID: PMC5677597 DOI: 10.1111/biom.12715] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 03/01/2017] [Accepted: 04/01/2017] [Indexed: 01/09/2023]
Abstract
Integrative analysis of high dimensional omics data is becoming increasingly popular. At the same time, incorporating known functional relationships among variables in analysis of omics data has been shown to help elucidate underlying mechanisms for complex diseases. In this article, our goal is to assess association between transcriptomic and metabolomic data from a Predictive Health Institute (PHI) study that includes healthy adults at a high risk of developing cardiovascular diseases. Adopting a strategy that is both data-driven and knowledge-based, we develop statistical methods for sparse canonical correlation analysis (CCA) with incorporation of known biological information. Our proposed methods use prior network structural information among genes and among metabolites to guide selection of relevant genes and metabolites in sparse CCA, providing insight on the molecular underpinning of cardiovascular disease. Our simulations demonstrate that the structured sparse CCA methods outperform several existing sparse CCA methods in selecting relevant genes and metabolites when structural information is informative and are robust to mis-specified structural information. Our analysis of the PHI study reveals that a number of gene and metabolic pathways including some known to be associated with cardiovascular diseases are enriched in the set of genes and metabolites selected by our proposed approach.
Collapse
Affiliation(s)
- Sandra E Safo
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, U.S.A
| | - Shuzhao Li
- Department of Medicine, Division of Pulmonary, Allergy and Critical Care Medicine, Emory University, Atlanta, Georgia, U.S.A
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, U.S.A
| |
Collapse
|
24
|
Li Z, Safo SE, Long Q. Incorporating biological information in sparse principal component analysis with application to genomic data. BMC Bioinformatics 2017; 18:332. [PMID: 28697740 PMCID: PMC5504598 DOI: 10.1186/s12859-017-1740-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Accepted: 06/22/2017] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In this article, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection. RESULTS Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma. CONCLUSIONS The proposed sparse PCA methods Fused and Grouped sparse PCA can effectively incorporate prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insights on molecular underpinnings of complex diseases.
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, 30322 GA USA
| | - Sandra E. Safo
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, 30322 GA USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104 PA USA
| |
Collapse
|
25
|
Jackson SL, Safo SE, Staimez LR, Olson DE, Narayan KMV, Long Q, Lipscomb J, Rhee MK, Wilson PWF, Tomolo AM, Phillips LS. Glucose challenge test screening for prediabetes and early diabetes. Diabet Med 2017; 34:716-724. [PMID: 27727467 PMCID: PMC5388592 DOI: 10.1111/dme.13270] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 08/15/2016] [Accepted: 10/06/2016] [Indexed: 12/29/2022]
Abstract
AIMS To test the hypothesis that a 50-g oral glucose challenge test with 1-h glucose measurement would have superior performance compared with other opportunistic screening methods. METHODS In this prospective study in a Veterans Health Administration primary care clinic, the following test performances, measured by area under receiver-operating characteristic curves, were compared: 50-g oral glucose challenge test; random glucose; and HbA1c level, using a 75-g oral glucose tolerance test as the 'gold standard'. RESULTS The study population was comprised of 1535 people (mean age 56 years, BMI 30.3 kg/m2 , 94% men, 74% black). By oral glucose tolerance test criteria, diabetes was present in 10% and high-risk prediabetes was present in 22% of participants. The plasma glucose challenge test provided area under receiver-operating characteristic curves of 0.85 (95% CI 0.78-0.91) to detect diabetes and 0.76 (95% CI 0.72-0.80) to detect high-risk dysglycaemia (diabetes or high-risk prediabetes), while area under receiver-operating characteristic curves for the capillary glucose challenge test were 0.82 (95% CI 0.75-0.89) and 0.73 (95% CI 0.69-0.77) for diabetes and high-risk dysglycaemia, respectively. Random glucose performed less well [plasma: 0.76 (95% CI 0.69-0.82) and 0.66 (95% CI 0.62-0.71), respectively; capillary: 0.72 (95% CI 0.65-0.80) and 0.64 (95% CI 0.59-0.68), respectively], and HbA1c performed even less well [0.67 (95% CI 0.57-0.76) and 0.63 (95% CI 0.58-0.68), respectively]. The cost of identifying one case of high-risk dysglycaemia with a plasma glucose challenge test would be $42 from a Veterans Health Administration perspective, and $55 from a US Medicare perspective. CONCLUSIONS Glucose challenge test screening, followed, if abnormal, by an oral glucose tolerance test, would be convenient and more accurate than other opportunistic tests. Use of glucose challenge test screening could improve management by permitting earlier therapy.
Collapse
Affiliation(s)
- S L Jackson
- Atlanta VA Medical Center, Decatur, GA, USA
- Nutrition and Health Sciences, Graduate Division of Biological and Biomedical Sciences, Emory University, Atlanta, GA, USA
| | - S E Safo
- Atlanta VA Medical Center, Decatur, GA, USA
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - L R Staimez
- Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - D E Olson
- Atlanta VA Medical Center, Decatur, GA, USA
- Division of Endocrinology and Metabolism, Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - K M V Narayan
- Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Q Long
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - J Lipscomb
- Department of Health Policy and Management, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - M K Rhee
- Atlanta VA Medical Center, Decatur, GA, USA
- Division of Endocrinology and Metabolism, Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | | | - A M Tomolo
- Atlanta VA Medical Center, Decatur, GA, USA
- Division of General Medicine, Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - L S Phillips
- Atlanta VA Medical Center, Decatur, GA, USA
- Division of Endocrinology and Metabolism, Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
| |
Collapse
|