1
|
Chan HC, Chattopadhyay A, Lu TP. Cross-population enhancement of PrediXcan predictions with a gnomAD-based east Asian reference framework. Brief Bioinform 2024; 25:bbae549. [PMID: 39441246 PMCID: PMC11497844 DOI: 10.1093/bib/bbae549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 09/02/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024] Open
Abstract
Over the past decade, genome-wide association studies have identified thousands of variants significantly associated with complex traits. For each locus, gene expression levels are needed to further explore its biological functions. To address this, the PrediXcan algorithm leverages large-scale reference data to impute the gene expression level from single nucleotide polymorphisms, and thus the gene-trait associations can be tested to identify the candidate causal genes. However, a challenge arises due to the fact that most reference data are from subjects of European ancestry, and the accuracy and robustness of predicted gene expression in subjects of East Asian (EAS) ancestry remains unclear. Here, we first simulated a variety of scenarios to explore the impact of the level of population diversity on gene expression. Population differentiated variants were estimated by using the allele frequency information from The Genome Aggregation Database. We found that the weights of a variants was the main factor that affected the gene expression predictions, and that ~70% of variants were significantly population differentiated based on proportion tests. To provide insights into this population effect on gene expression levels, we utilized the allele frequency information to develop a gene expression reference panel, Predict Asian-Population (PredictAP), for EAS ancestry. PredictAP can be viewed as an auxiliary tool for PrediXcan when using genotype data from EAS subjects.
Collapse
Affiliation(s)
- Han-Ching Chan
- Institute of Epidemiology and Preventive Medicine, Department of Public Health, National Taiwan University, Room 518, No. 17, Xu-Zhou Road, Taipei 10055, Taiwan
| | - Amrita Chattopadhyay
- Institute of Epidemiology and Preventive Medicine, Department of Public Health, National Taiwan University, Room 518, No. 17, Xu-Zhou Road, Taipei 10055, Taiwan
| | - Tzu-Pin Lu
- Institute of Epidemiology and Preventive Medicine, Department of Public Health, National Taiwan University, Room 518, No. 17, Xu-Zhou Road, Taipei 10055, Taiwan
- Institute of Health Data Analytics and Statistics, Department of Public Health, National Taiwan University, Room 518, No. 17, Xu-Zhou Road, Taipei 10055, Taiwan
| |
Collapse
|
2
|
Wu Q, Wang C, Chen Y. Heterogeneous latent transfer learning in Gaussian graphical models. Biometrics 2024; 80:ujae096. [PMID: 39302138 PMCID: PMC11413907 DOI: 10.1093/biomtc/ujae096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 07/04/2024] [Accepted: 08/23/2024] [Indexed: 09/26/2024]
Abstract
Gaussian graphical models (GGMs) are useful for understanding the complex relationships between biological entities. Transfer learning can improve the estimation of GGMs in a target dataset by incorporating relevant information from related source studies. However, biomedical research often involves intrinsic and latent heterogeneity within a study, such as heterogeneous subpopulations. This heterogeneity can make it difficult to identify informative source studies or lead to negative transfer if the source study is improperly used. To address this challenge, we developed a heterogeneous latent transfer learning (Latent-TL) approach that accounts for both within-sample and between-sample heterogeneity. The idea behind this approach is to "learn from the alike" by leveraging the similarities between source and target GGMs within each subpopulation. The Latent-TL algorithm simultaneously identifies common subpopulation structures among samples and facilitates the learning of target GGMs using source samples from the same subpopulation. Through extensive simulations and real data application, we have shown that the proposed method outperforms single-site learning and standard transfer learning that ignores the latent structures. We have also demonstrated the applicability of the proposed algorithm in characterizing gene co-expression networks in breast cancer patients, where the inferred genetic networks identified many biologically meaningful gene-gene interactions.
Collapse
Affiliation(s)
- Qiong Wu
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, 19104, United States
- The Center for Health AI and Synthesis of Evidence (CHASE), The University of Pennsylvania, Philadelphia, PA, 19104, United States
- Department of Biostatistics, The University of Pittsburgh, Philadelphia, PA, 15261, United States
| | - Chi Wang
- Division of Cancer Biostatistics, Department of Internal Medicine, College of Medicine and Department of Statistics, The University of Kentucky, Lexington, KY, 40536, United States
| | - Yong Chen
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, 19104, United States
- The Center for Health AI and Synthesis of Evidence (CHASE), The University of Pennsylvania, Philadelphia, PA, 19104, United States
| |
Collapse
|
3
|
Zhao B, Yang X, Zhu H. Estimating trans-ancestry genetic correlation with unbalanced data resources. J Am Stat Assoc 2024; 119:839-850. [PMID: 39219674 PMCID: PMC11364214 DOI: 10.1080/01621459.2024.2344703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 04/07/2024] [Indexed: 09/04/2024]
Abstract
The aim of this paper is to propose a novel method for estimating trans-ancestry genetic correlations in genome-wide association studies (GWAS) using genetically-predicted observations. These correlations describe how genetic architecture of complex traits varies among populations. Our new estimator corrects for biases arising from prediction errors in high-dimensional weak GWAS signals, while addressing the ethnic diversity inherent in GWAS data, such as linkage disequilibrium (LD) differences. A distinguishing feature of our approach is its flexibility regarding sample sizes: it necessitates a large GWAS sample only from one population, while the secondary population may have a much smaller cohort, even in the hundreds. This design directly addresses the existing imbalance in GWAS data resources, where datasets for European populations typically outnumber those of non-European ancestries. Through extensive simulations and real data analysis from the UK Biobank study encompassing 26 complex traits, we validate the reliability of our method. Our results illuminate the broader implications of transferring genetic findings across diverse populations.
Collapse
Affiliation(s)
- Bingxin Zhao
- Department of Statistics and Data Science, University of Pennsylvania
| | | | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill
| |
Collapse
|
4
|
Gao D, Wang Y, Zeng D. Fusing Individualized Treatment Rules Using Secondary Outcomes. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2024; 238:712-720. [PMID: 39371406 PMCID: PMC11450767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 10/08/2024]
Abstract
An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.
Collapse
|
5
|
Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform 2024; 152:104629. [PMID: 38552994 DOI: 10.1016/j.jbi.2024.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 02/26/2024] [Accepted: 03/25/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND In health research, multimodal omics data analysis is widely used to address important clinical and biological questions. Traditional statistical methods rely on the strong assumptions of distribution. Statistical methods such as testing and differential expression are commonly used in omics analysis. Deep learning, on the other hand, is an advanced computer science technique that is powerful in mining high-dimensional omics data for prediction tasks. Recently, integrative frameworks or methods have been developed for omics studies that combine statistical models and deep learning algorithms. METHODS AND RESULTS The aim of these integrative frameworks is to combine the strengths of both statistical methods and deep learning algorithms to improve prediction accuracy while also providing interpretability and explainability. This review report discusses the current state-of-the-art integrative frameworks, their limitations, and potential future directions in survival and time-to-event longitudinal analysis, dimension reduction and clustering, regression and classification, feature selection, and causal and transfer learning.
Collapse
Affiliation(s)
- Leann Lac
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Biochemistry, Western University, London, Ontario, Canada; Department of Computer Science, Western University, London, Ontario, Canada; Department of Oncology, Western University, London, Ontario, Canada; Department of Epidemiology and Biostatistics, Western University, London, Ontario, Canada; The Children's Health Research Institute, Lawson Health Research Institute, London, Ontario, Canada.
| |
Collapse
|
6
|
Zhang S, Jiang Z, Zeng P. Incorporating genetic similarity of auxiliary samples into eGene identification under the transfer learning framework. J Transl Med 2024; 22:258. [PMID: 38461317 PMCID: PMC10924384 DOI: 10.1186/s12967-024-05053-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 03/01/2024] [Indexed: 03/11/2024] Open
Abstract
BACKGROUND The term eGene has been applied to define a gene whose expression level is affected by at least one independent expression quantitative trait locus (eQTL). It is both theoretically and empirically important to identify eQTLs and eGenes in genomic studies. However, standard eGene detection methods generally focus on individual cis-variants and cannot efficiently leverage useful knowledge acquired from auxiliary samples into target studies. METHODS We propose a multilocus-based eGene identification method called TLegene by integrating shared genetic similarity information available from auxiliary studies under the statistical framework of transfer learning. We apply TLegene to eGene identification in ten TCGA cancers which have an explicit relevant tissue in the GTEx project, and learn genetic effect of variant in TCGA from GTEx. We also adopt TLegene to the Geuvadis project to evaluate its usefulness in non-cancer studies. RESULTS We observed substantial genetic effect correlation of cis-variants between TCGA and GTEx for a larger number of genes. Furthermore, consistent with the results of our simulations, we found that TLegene was more powerful than existing methods and thus identified 169 distinct candidate eGenes, which was much larger than the approach that did not consider knowledge transfer across target and auxiliary studies. Previous studies and functional enrichment analyses provided empirical evidence supporting the associations of discovered eGenes, and it also showed evidence of allelic heterogeneity of gene expression. Furthermore, TLegene identified more eGenes in Geuvadis and revealed that these eGenes were mainly enriched in cells EBV transformed lymphocytes tissue. CONCLUSION Overall, TLegene represents a flexible and powerful statistical method for eGene identification through transfer learning of genetic similarity shared across auxiliary and target studies.
Collapse
Affiliation(s)
- Shuo Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Zhou Jiang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Xuzhou Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Jiangsu Engineering Research Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
| |
Collapse
|
7
|
Carney M, Pelaia TM, Chew T, Teoh S, Phu A, Kim K, Wang Y, Iredell J, Zerbib Y, McLean A, Schughart K, Tang B, Shojaei M, Short KR. Host transcriptomics and machine learning for secondary bacterial infections in patients with COVID-19: a prospective, observational cohort study. THE LANCET. MICROBE 2024; 5:e272-e281. [PMID: 38310908 DOI: 10.1016/s2666-5247(23)00363-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 10/27/2023] [Accepted: 10/27/2023] [Indexed: 02/06/2024]
Abstract
BACKGROUND Viral respiratory tract infections are frequently complicated by secondary bacterial infections. This study aimed to use machine learning to predict the risk of bacterial superinfection in SARS-CoV-2-positive individuals. METHODS In this prospective, multicentre, observational cohort study done in nine centres in six countries (Australia, Indonesia, Singapore, Italy, Czechia, and France) blood samples and RNA sequencing were used to develop a robust model of predicting secondary bacterial infections in the respiratory tract of patients with COVID-19. Eligible participants were older than 18 years, had known or suspected COVID-19, and symptoms of a recent respiratory infection. A control cohort of participants without COVID-19 who were older than 18 years and with no infection symptoms was also recruited from one Australian centre. In the pre-analysis phase, data were filtered to include only individuals with complete blood transcriptomics and patient data (ie, age, sex, location, and WHO severity score at the time of sample collection). The dataset was then divided randomly (4:1) into a training set (80%) and a test set (20%). Gene expression data in the training set and control cohort were used for differential expression analysis. Differentially expressed genes, along with WHO severity score, location, age, and sex, were used for feature selection with least absolute shrinkage and selection operator (LASSO) in the training set. For LASSO analysis, samples were excluded if gene expression data were not obtained at study admission, no longitudinal clinical information was available, a bacterial infection at the time of study admission was present, or a fungal infection in the absence of a bacterial infection was detected. LASSO regression was performed using three subsets of predictor variables: patient data alone, gene expression data alone, or a combination of patient data and gene expression data. The accuracy of the resultant models was tested on data from the test set. FINDINGS Between March, 2020, and October, 2021, we recruited 536 SARS-CoV-2-positive individuals and between June, 2013, and January, 2020, we recruited 74 participants into the control cohort. After prefiltering analysis and other exclusions, samples from 158 individuals were analysed in the training set and 47 in the test set. The expression of seven host genes (DAPP1, CST3, FGL2, GCH1, CIITA, UPP1, and RN7SL1) in the blood at the time of study admission was identified by LASSO as predictive of the risk of developing a secondary bacterial infection of the respiratory tract more than 24 h after study admission. Specifically, the expression of these genes in combination with a patient's WHO severity score at the time of study enrolment resulted in an area under the curve of 0·98 (95% CI 0·89-1·00), a true positive rate (sensitivity) of 1·00 (95% CI 1·00-1·00), and a true negative rate (specificity) of 0·94 (95% CI 0·89-1·00) in the test cohort. The combination of patient data and host transcriptomics at hospital admission identified all seven individuals in the training and test sets who developed a bacterial infection of the respiratory tract 5-9 days after hospital admission. INTERPRETATION These data raise the possibility that host transcriptomics at the time of clinical presentation, together with machine learning, can forward predict the risk of secondary bacterial infections and allow for the more targeted use of antibiotics in viral infection. FUNDING Snow Medical Research Foundation, the National Health and Medical Research Council, the Jack Ma Foundation, the Helmholtz-Association, the A2 Milk Company, National Institute of Allergy and Infectious Disease, and the Fondazione AIRC Associazione Italiana per la Ricerca contro il Cancro.
Collapse
Affiliation(s)
- Meagan Carney
- School of Mathematics and Physics, University of Queensland, Brisbane, QLD, Australia
| | - Tiana Maria Pelaia
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia
| | - Tracy Chew
- Sydney Informatics Hub, Core Research Facilities, University of Sydney, Sydney, NSW, Australia
| | - Sally Teoh
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia
| | - Amy Phu
- Faculty of Medicine and Health, Sydney Medical School Westmead, Westmead Hospital, University of Sydney, Sydney, NSW, Australia
| | - Karan Kim
- Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, NSW, Australia
| | - Ya Wang
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia; The University of Sydney Nepean Clinical School, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia; Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, NSW, Australia
| | - Jonathan Iredell
- Faculty of Medicine and Health, School of Medical Sciences, University of Sydney, Sydney, NSW, Australia; Sydney Institute for Infectious Disease, University of Sydney, Sydney, NSW, Australia; Centre for Infectious Diseases and Microbiology, Westmead Institute for Medical Research, Sydney, NSW, Australia; Westmead Hospital, Western Sydney Local Health District, Westmead, NSW, Australia
| | - Yoann Zerbib
- Intensive Care Department, Amiens University Hospital, Amiens, France
| | - Anthony McLean
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia; The University of Sydney Nepean Clinical School, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia
| | - Klaus Schughart
- Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, TN, USA; Institute of Virology Münster, University of Münster, Münster, Germany
| | - Benjamin Tang
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia; Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, NSW, Australia
| | - Maryam Shojaei
- Department of Intensive Care Medicine, Nepean Hospital, Sydney, NSW, Australia; The University of Sydney Nepean Clinical School, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia; Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, NSW, Australia.
| | - Kirsty R Short
- School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia.
| |
Collapse
|
8
|
Li BS, Cai T, Duan R. TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH. Ann Appl Stat 2023; 17:2970-2992. [PMID: 39314265 PMCID: PMC11417462 DOI: 10.1214/23-aoas1747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA's estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.
Collapse
Affiliation(s)
- By Sai Li
- Institute of Statistics and Big Data, Renmin University of China
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| |
Collapse
|
9
|
Lu H, Zhang S, Jiang Z, Zeng P. Leveraging trans-ethnic genetic risk scores to improve association power for complex traits in underrepresented populations. Brief Bioinform 2023:bbad232. [PMID: 37332016 DOI: 10.1093/bib/bbad232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 05/06/2023] [Accepted: 06/04/2023] [Indexed: 06/20/2023] Open
Abstract
Trans-ethnic genome-wide association studies have revealed that many loci identified in European populations can be reproducible in non-European populations, indicating widespread trans-ethnic genetic similarity. However, how to leverage such shared information more efficiently in association analysis is less investigated for traits in underrepresented populations. We here propose a statistical framework, trans-ethnic genetic risk score informed gene-based association mixed model (GAMM), by hierarchically modeling single-nucleotide polymorphism effects in the target population as a function of effects of the same trait in well-studied populations. GAMM powerfully integrates genetic similarity across distinct ancestral groups to enhance power in understudied populations, as confirmed by extensive simulations. We illustrate the usefulness of GAMM via the application to 13 blood cell traits (i.e. basophil count, eosinophil count, hematocrit, hemoglobin concentration, lymphocyte count, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, mean corpuscular volume, monocyte count, neutrophil count, platelet count, red blood cell count and total white blood cell count) in Africans of the UK Biobank (n = 3204) while utilizing genetic overlap shared in Europeans (n = 746 667) and East Asians (n = 162 255). We discovered multiple new associated genes, which had otherwise been missed by existing methods, and revealed that the trans-ethnic information indirectly contributed much to the phenotypic variance. Overall, GAMM represents a flexible and powerful statistical framework of association analysis for complex traits in underrepresented populations by integrating trans-ethnic genetic similarity across well-studied populations, and helps attenuate health inequities in current genetics research for people of minority populations.
Collapse
Affiliation(s)
- Haojie Lu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Shuo Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Zhou Jiang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| |
Collapse
|
10
|
Li S, Zhang L, Tony Cai T, Li H. Estimation and Inference for High-Dimensional Generalized Linear Models with Knowledge Transfer. J Am Stat Assoc 2023; 119:1274-1285. [PMID: 38948492 PMCID: PMC11213555 DOI: 10.1080/01621459.2023.2184373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 02/15/2023] [Indexed: 03/06/2023]
Abstract
Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional generalized linear models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of convergence for estimation is established and the proposed estimator is shown to be rate-optimal. Statistical inference for the target regression coefficients is also studied. Asymptotic normality for a debiased estimator is established, which can be used for constructing coordinate-wise confidence intervals of the regression coefficients. Numerical studies show significant improvement in estimation and inference accuracy over GLMs that only use the target data. The proposed methods are applied to a real data study concerning the classification of colorectal cancer using gut microbiomes, and are shown to enhance the classification accuracy in comparison to methods that only use the target data.
Collapse
Affiliation(s)
- Sai Li
- Institute of Statistics and Big Data, Renmin University of China, China
| | - Linjun Zhang
- Department of Statistics, Rutgers University, New Brunswick, NJ 08854
| | - T Tony Cai
- Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104
| |
Collapse
|
11
|
Chen S, Zheng Q, Long Q, Su WJ. Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training? JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2023; 24:262. [PMID: 39105110 PMCID: PMC11299893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/07/2024]
Abstract
A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual's perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an approximate alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant. As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.
Collapse
|
12
|
He Y, Li Q, Hu Q, Liu L. Transfer learning in high-dimensional semiparametric graphical models with application to brain connectivity analysis. Stat Med 2022; 41:4112-4129. [PMID: 35728799 PMCID: PMC9497459 DOI: 10.1002/sim.9499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 04/25/2022] [Accepted: 05/30/2022] [Indexed: 11/11/2022]
Abstract
Transfer learning has drawn growing attention with the target of improving statistical efficiency of one study (dataset) by digging up information from similar and related auxiliary studies (datasets). In this article, we consider transfer learning problem in estimating undirected semiparametric graphical model. We propose an algorithm called Trans-Copula-CLIME for estimating an undirected graphical model while uncovering information from similar auxiliary studies, characterizing the similarity between the target graph and each auxiliary graph by the sparsity of a divergence matrix. The proposed method relaxes the restrictive Gaussian distribution assumption, which deviates from reality for the fMRI dataset related to attention deficit hyperactivity disorder (ADHD) considered here. Nonparametric rank-based correlation coefficient estimators are utilized in the Trans-Copula-CLIME procedure to achieve robustness against normality. We establish the convergence rate of the Trans-Copula-CLIME estimator under some mild conditions, which demonstrates that if the similarity between the auxiliary studies and the target study is sufficiently high and the number of informative auxiliary samples is sufficiently large, the Trans-Copula-CLIME estimator shows great advantage over the existing non-transfer-learning ones. Simulation studies also show that Trans-Copula-CLIME estimator has better performance especially when data are not from Gaussian distribution. Finally, the proposed method is applied to infer functional brain connectivity pattern for ADHD patients in the target Beijing site by leveraging the fMRI datasets from some other sites.
Collapse
Affiliation(s)
- Yong He
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Qiushi Li
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Qinqin Hu
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, Shandong, China
| | - Lei Liu
- Division of Biostatistics, Washington University in St. Louis, St. Louis, U.S.A
| |
Collapse
|
13
|
Tian P, Chan TH, Wang YF, Yang W, Yin G, Zhang YD. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front Genet 2022; 13:906965. [PMID: 36061179 PMCID: PMC9438789 DOI: 10.3389/fgene.2022.906965] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 06/27/2022] [Indexed: 11/28/2022] Open
Abstract
Polygenic risk scores (PRS) leverage the genetic contribution of an individual’s genotype to a complex trait by estimating disease risk. Traditional PRS prediction methods are predominantly for the European population. The accuracy of PRS prediction in non-European populations is diminished due to much smaller sample size of genome-wide association studies (GWAS). In this article, we introduced a novel method to construct PRS for non-European populations, abbreviated as TL-Multi, by conducting a transfer learning framework to learn useful knowledge from the European population to correct the bias for non-European populations. We considered non-European GWAS data as the target data and European GWAS data as the informative auxiliary data. TL-Multi borrows useful information from the auxiliary data to improve the learning accuracy of the target data while preserving the efficiency and accuracy. To demonstrate the practical applicability of the proposed method, we applied TL-Multi to predict the risk of systemic lupus erythematosus (SLE) in the Asian population and the risk of asthma in the Indian population by borrowing information from the European population. TL-Multi achieved better prediction accuracy than the competing methods, including Lassosum and meta-analysis in both simulations and real applications.
Collapse
Affiliation(s)
- Peixin Tian
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
| | - Tsai Hor Chan
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
| | - Yong-Fei Wang
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Wanling Yang
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Guosheng Yin
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
| | - Yan Dora Zhang
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China
- Centre for PanorOmic Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- *Correspondence: Yan Dora Zhang,
| |
Collapse
|
14
|
Tian Y, Feng Y. Transfer Learning under High-dimensional Generalized Linear Models. J Am Stat Assoc 2022; 118:2684-2697. [PMID: 38562655 PMCID: PMC10982637 DOI: 10.1080/01621459.2022.2071278] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Accepted: 04/20/2022] [Indexed: 10/18/2022]
Abstract
In this work, we study the transfer learning problem under highdimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its ℓ1 / ℓ2-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN.
Collapse
Affiliation(s)
- Ye Tian
- Department of Statistics, Columbia University
| | - Yang Feng
- Department of Biostatistics, School of Global Public Health, New York University
| |
Collapse
|
15
|
Gao Y, Cui Y. Clinical time-to-event prediction enhanced by incorporating compatible related outcomes. PLOS DIGITAL HEALTH 2022; 1:e0000038. [PMID: 35757279 PMCID: PMC9222982 DOI: 10.1371/journal.pdig.0000038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 04/05/2022] [Indexed: 06/15/2023]
Abstract
Accurate time-to-event (TTE) prediction of clinical outcomes from personal biomedical data is essential for precision medicine. It has become increasingly common that clinical datasets contain information for multiple related patient outcomes from comorbid diseases or multifaceted endpoints of a single disease. Various TTE models have been developed to handle competing risks that are related to mutually exclusive events. However, clinical outcomes are often non-competing and can occur at the same time or sequentially. Here we develop TTE prediction models with the capacity of incorporating compatible related clinical outcomes. We test our method on real and synthetic data and find that the incorporation of related auxiliary clinical outcomes can: 1) significantly improve the TTE prediction performance of conventional Cox model while maintaining its interpretability; 2) further improve the performance of the state-of-the-art deep learning based models. While the auxiliary outcomes are utilized for model training, the model deployment is not limited by the availability of the auxiliary outcome data because the auxiliary outcome information is not required for the prediction of the primary outcome once the model is trained.
Collapse
Affiliation(s)
- Yan Gao
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| | - Yan Cui
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- Center for Cancer Research, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| |
Collapse
|
16
|
Li S, Cai TT, Li H. Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control. J Am Stat Assoc 2022; 118:2171-2183. [PMID: 38143788 PMCID: PMC10746133 DOI: 10.1080/01621459.2022.2044333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 02/09/2022] [Indexed: 10/19/2022]
Abstract
Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed.
Collapse
Affiliation(s)
- Sai Li
- Institute of Statistics and Big Data, Renmin University of China, China. Most of her work was done during her postdoc at Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania
| | - T Tony Cai
- Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104
| |
Collapse
|