1
|
Li R, Romano JD, Chen Y, Moore JH. Centralized and Federated Models for the Analysis of Clinical Data. Annu Rev Biomed Data Sci 2024; 7:179-199. [PMID: 38723657 DOI: 10.1146/annurev-biodatasci-122220-115746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.
Collapse
Affiliation(s)
- Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA;
| | - Joseph D Romano
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA;
| |
Collapse
|
2
|
Camirand Lemyre F, Lévesque S, Domingue MP, Herrmann K, Ethier JF. Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics. JMIR Med Inform 2024. [PMID: 39028684 DOI: 10.2196/53622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/21/2024] Open
Abstract
BACKGROUND Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for health frameworks. OBJECTIVE This paper aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data; (2) describing the methods applicable to generalized linear models (GLM) and assessing their underlying distributional assumptions; (3) adapting existing methods to make them fully usable in health settings. METHODS A scoping review methodology was employed for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in health settings. Statistical theory was used to adapt methods and to derive the properties of the resulting estimators. RESULTS From the review, 41 articles were selected, and six approaches were extracted for conducting standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information-sharing requirements and operational complexity. CONCLUSIONS This paper contributes to the field of health analytics by providing an overview of the methods that can be used with horizontally partitioned data, by adapting these methods to the context of heterogeneous health data and by clarifying the workflows and quantities exchanged by the methods discussed. Further analysis of the confidentiality preserved by these methods is needed to fully understand the risk associated with the sharing of summary statistics.
Collapse
Affiliation(s)
- Félix Camirand Lemyre
- GRIIS, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke, CA
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, CA
| | - Simon Lévesque
- GRIIS, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke, CA
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, CA
- Health Data Research Network Canada, Vancouver, CA
| | - Marie-Pier Domingue
- GRIIS, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke, CA
- Chaire MEIE Québec - Le numérique au service des systèmes de santé apprenants, Université de Sherbrooke, Sherbrooke, CA
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, CA
| | - Klaus Herrmann
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, CA
| | - Jean-François Ethier
- GRIIS, Université de Sherbrooke, 2500, boul. de l'Université, Sherbrooke, CA
- Département de médecine, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, CA
- Health Data Research Network Canada, Vancouver, CA
| |
Collapse
|
3
|
Tong J, Shen Y, Xu A, He X, Luo C, Edmondson M, Zhang D, Lu Y, Yan C, Li R, Siegel L, Sun L, Shenkman EA, Morton SC, Malin BA, Bian J, Asch DA, Chen Y. Evaluating site-of-care-related racial disparities in kidney graft failure using a novel federated learning framework. J Am Med Inform Assoc 2024; 31:1303-1312. [PMID: 38713006 PMCID: PMC11105132 DOI: 10.1093/jamia/ocae075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/08/2024] Open
Abstract
OBJECTIVES Racial disparities in kidney transplant access and posttransplant outcomes exist between non-Hispanic Black (NHB) and non-Hispanic White (NHW) patients in the United States, with the site of care being a key contributor. Using multi-site data to examine the effect of site of care on racial disparities, the key challenge is the dilemma in sharing patient-level data due to regulations for protecting patients' privacy. MATERIALS AND METHODS We developed a federated learning framework, named dGEM-disparity (decentralized algorithm for Generalized linear mixed Effect Model for disparity quantification). Consisting of 2 modules, dGEM-disparity first provides accurately estimated common effects and calibrated hospital-specific effects by requiring only aggregated data from each center and then adopts a counterfactual modeling approach to assess whether the graft failure rates differ if NHB patients had been admitted at transplant centers in the same distribution as NHW patients were admitted. RESULTS Utilizing United States Renal Data System data from 39 043 adult patients across 73 transplant centers over 10 years, we found that if NHB patients had followed the distribution of NHW patients in admissions, there would be 38 fewer deaths or graft failures per 10 000 NHB patients (95% CI, 35-40) within 1 year of receiving a kidney transplant on average. DISCUSSION The proposed framework facilitates efficient collaborations in clinical research networks. Additionally, the framework, by using counterfactual modeling to calculate the event rate, allows us to investigate contributions to racial disparities that may occur at the level of site of care. CONCLUSIONS Our framework is broadly applicable to other decentralized datasets and disparities research related to differential access to care. Ultimately, our proposed framework will advance equity in human health by identifying and addressing hospital-level racial disparities.
Collapse
Affiliation(s)
- Jiayi Tong
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Yishan Shen
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Applied Mathematics and Computational Science, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Alice Xu
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Washington University in St. Louis, St. Louis, MO 63130, United States
| | - Xing He
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - Chongliang Luo
- Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis, St. Louis, MO 63110, United States
| | | | - Dazheng Zhang
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Yiwen Lu
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Lianne Siegel
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, United States
| | - Lichao Sun
- Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, United States
| | - Elizabeth A Shenkman
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - Sally C Morton
- School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ 85287, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - David A Asch
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
| | - Yong Chen
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Applied Mathematics and Computational Science, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
| |
Collapse
|
4
|
Zhang D, Tong J, Jing N, Yang Y, Luo C, Lu Y, Christakis DA, Güthe D, Hornig M, Kelleher KJ, Morse KE, Rogerson CM, Divers J, Carroll RJ, Forrest CB, Chen Y. Learning competing risks across multiple hospitals: one-shot distributed algorithms. J Am Med Inform Assoc 2024; 31:1102-1112. [PMID: 38456459 PMCID: PMC11031234 DOI: 10.1093/jamia/ocae027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 12/30/2023] [Accepted: 02/03/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES To characterize the complex interplay between multiple clinical conditions in a time-to-event analysis framework using data from multiple hospitals, we developed two novel one-shot distributed algorithms for competing risk models (ODACoR). By applying our algorithms to the EHR data from eight national children's hospitals, we quantified the impacts of a wide range of risk factors on the risk of post-acute sequelae of SARS-COV-2 (PASC) among children and adolescents. MATERIALS AND METHODS Our ODACoR algorithms are effectively executed due to their devised simplicity and communication efficiency. We evaluated our algorithms via extensive simulation studies as applications to quantification of the impacts of risk factors for PASC among children and adolescents using data from eight children's hospitals including the Children's Hospital of Philadelphia, Cincinnati Children's Hospital Medical Center, Children's Hospital of Colorado covering over 6.5 million pediatric patients. The accuracy of the estimation was assessed by comparing the results from our ODACoR algorithms with the estimators derived from the meta-analysis and the pooled data. RESULTS The meta-analysis estimator showed a high relative bias (∼40%) when the clinical condition is relatively rare (∼0.5%), whereas ODACoR algorithms exhibited a substantially lower relative bias (∼0.2%). The estimated effects from our ODACoR algorithms were identical on par with the estimates from the pooled data, suggesting the high reliability of our federated learning algorithms. In contrast, the meta-analysis estimate failed to identify risk factors such as age, gender, chronic conditions history, and obesity, compared to the pooled data. DISCUSSION Our proposed ODACoR algorithms are communication-efficient, highly accurate, and suitable to characterize the complex interplay between multiple clinical conditions. CONCLUSION Our study demonstrates that our ODACoR algorithms are communication-efficient and can be widely applicable for analyzing multiple clinical conditions in a time-to-event analysis framework.
Collapse
Affiliation(s)
- Dazheng Zhang
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Jiayi Tong
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Naimin Jing
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Biostatistics and Research Decision Sciences, Merck & Co., Inc, Rahway, NJ 07065, United States
| | - Yuchen Yang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Division of Public Health Sciences, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States
| | - Yiwen Lu
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, United States
| | | | - Diana Güthe
- Survivor Corps, Washington, DC 20814, United States
| | - Mady Hornig
- Department of Epidemiology, Columbia University Mailman School of Public Health, New York, NY 10032, United States
| | - Kelly J Kelleher
- Research Institute at Nationwide Children’s Hospital, Columbus, OH 43205, United States
| | - Keith E Morse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States
| | - Colin M Rogerson
- Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Jasmin Divers
- Department of Foundations of Medicine, New York University Long Island School of Medicine, Mineola, NY 11501, United States
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, College Station, TX 77843, United States
| | - Christopher B Forrest
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
| | - Yong Chen
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, United States
- Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
- Penn Medicine Center for Evidence-based Practice (CEP), Philadelphia, PA 19104, United States
| |
Collapse
|
5
|
Zhang D, Tong J, Stein R, Lu Y, Jing N, Yang Y, Boland MR, Luo C, Baldassano RN, Carroll RJ, Forrest CB, Chen Y. One-shot distributed algorithms for addressing heterogeneity in competing risks data across clinical sites. J Biomed Inform 2024; 150:104595. [PMID: 38244958 PMCID: PMC11002871 DOI: 10.1016/j.jbi.2024.104595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 12/15/2023] [Accepted: 01/15/2024] [Indexed: 01/22/2024]
Abstract
OBJECTIVE To characterize the interplay between multiple medical conditions across sites and account for the heterogeneity in patient population characteristics across sites within a distributed research network, we develop a one-shot algorithm that can efficiently utilize summary-level data from various institutions. By applying our proposed algorithm to a large pediatric cohort across four national Children's hospitals, we replicated a recently published prospective cohort, the RISK study, and quantified the impact of the risk factors associated with the penetrating or stricturing behaviors of pediatric Crohn's disease (PCD). METHODS In this study, we introduce the ODACoRH algorithm, a one-shot distributed algorithm designed for the competing risks model with heterogeneity. Our approach considers the variability in baseline hazard functions of multiple endpoints of interest across different sites. To accomplish this, we build a surrogate likelihood function by combining patient-level data from the local site with aggregated data from other external sites. We validated our method through extensive simulation studies and replication of the RISK study to investigate the impact of risk factors on the PCD for adolescents and children from four children's hospitals within the PEDSnet, A National Pediatric Learning Health System. To evaluate our ODACoRH algorithm, we compared results from the ODACoRH algorithms with those from meta-analysis as well as those derived from the pooled data. RESULTS The ODACoRH algorithm had the smallest relative bias to the gold standard method (-0.2%), outperforming the meta-analysis method (-11.4%). In the PCD association study, the estimated subdistribution hazard ratios obtained through the ODACoRH algorithms are identical on par with the results derived from pooled data, which demonstrates the high reliability of our federated learning algorithms. From a clinical standpoint, the identified risk factors for PCD align well with the RISK study published in the Lancet in 2017 and other published studies, supporting the validity of our findings. CONCLUSION With the ODACoRH algorithm, we demonstrate the capability of effectively integrating data from multiple sites in a decentralized data setting while accounting for between-site heterogeneity. Importantly, our study reveals several crucial clinical risk factors for PCD that merit further investigations.
Collapse
Affiliation(s)
- Dazheng Zhang
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. https://twitter.com/DazhengZ
| | - Jiayi Tong
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. https://twitter.com/JiayiJessieTong
| | - Ronen Stein
- Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Pediatrics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - Yiwen Lu
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
| | - Naimin Jing
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Biostatistics and Research Decision Sciences, Merck & Co., Inc, NJ, USA
| | - Yuchen Yang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Mary R Boland
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Mathematics, Saint Vincent College, Latrobe, PA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Division of Public Health Sciences, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | - Robert N Baldassano
- Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Pediatrics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | | | - Christopher B Forrest
- Applied Clinical Research Center, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Yong Chen
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA; Leonard Davis Institute of Health Economics, Philadelphia, PA, USA; Penn Medicine Center for Evidence-based Practice (CEP), Philadelphia, PA, USA; Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA.
| |
Collapse
|
6
|
Li S, Liu P, Nascimento GG, Wang X, Leite FRM, Chakraborty B, Hong C, Ning Y, Xie F, Teo ZL, Ting DSW, Haddadi H, Ong MEH, Peres MA, Liu N. Federated and distributed learning applications for electronic health records and structured medical data: a scoping review. J Am Med Inform Assoc 2023; 30:2041-2049. [PMID: 37639629 PMCID: PMC10654866 DOI: 10.1093/jamia/ocad170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/19/2023] [Indexed: 08/31/2023] Open
Abstract
OBJECTIVES Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations, and discusses potential innovations. MATERIALS AND METHODS We searched 5 databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from 3 primary perspectives, including data quality, modeling strategies, and FL frameworks. RESULTS Out of the 1193 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. CONCLUSIONS The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.
Collapse
Affiliation(s)
- Siqi Li
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Pinyan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Gustavo G Nascimento
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Xinru Wang
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Fabio Renato Manzolli Leite
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Bibhas Chakraborty
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Feng Xie
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Zhen Ling Teo
- Singapore National Eye Centre, Singapore, Singapore Eye Research Institute, Singapore 168751, Singapore
| | - Daniel Shu Wei Ting
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Singapore National Eye Centre, Singapore, Singapore Eye Research Institute, Singapore 168751, Singapore
| | - Hamed Haddadi
- Department of Computing, Imperial College London, London SW7 2AZ, England, United Kingdom
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Department of Emergency Medicine, Singapore General Hospital, Singapore 169608, Singapore
| | - Marco Aurélio Peres
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore
| |
Collapse
|
7
|
Li S, Ning Y, Ong MEH, Chakraborty B, Hong C, Xie F, Yuan H, Liu M, Buckland DM, Chen Y, Liu N. FedScore: A privacy-preserving framework for federated scoring system development. J Biomed Inform 2023; 146:104485. [PMID: 37660960 DOI: 10.1016/j.jbi.2023.104485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 08/08/2023] [Accepted: 08/31/2023] [Indexed: 09/05/2023]
Abstract
OBJECTIVE We propose FedScore, a privacy-preserving federated learning framework for scoring system generation across multiple sites to facilitate cross-institutional collaborations. MATERIALS AND METHODS The FedScore framework includes five modules: federated variable ranking, federated variable transformation, federated score derivation, federated model selection and federated model evaluation. To illustrate usage and assess FedScore's performance, we built a hypothetical global scoring system for mortality prediction within 30 days after a visit to an emergency department using 10 simulated sites divided from a tertiary hospital in Singapore. We employed a pre-existing score generator to construct 10 local scoring systems independently at each site and we also developed a scoring system using centralized data for comparison. RESULTS We compared the acquired FedScore model's performance with that of other scoring models using the receiver operating characteristic (ROC) analysis. The FedScore model achieved an average area under the curve (AUC) value of 0.763 across all sites, with a standard deviation (SD) of 0.020. We also calculated the average AUC values and SDs for each local model, and the FedScore model showed promising accuracy and stability with a high average AUC value which was closest to the one of the pooled model and SD which was lower than that of most local models. CONCLUSION This study demonstrates that FedScore is a privacy-preserving scoring system generator with potentially good generalizability.
Collapse
Affiliation(s)
- Siqi Li
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Health Services Research Centre, Singapore Health Services, Singapore, Singapore; Department of Emergency Medicine, Singapore General Hospital, Singapore, Singapore
| | - Bibhas Chakraborty
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Feng Xie
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore
| | - Han Yuan
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Mingxuan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Daniel M Buckland
- Department of Emergency Medicine, Duke University School of Medicine, Durham, NC, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Institute of Data Science, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
8
|
Liu X, Duan R, Luo C, Ogdie A, Moore JH, Kranzler HR, Bian J, Chen Y. Multisite learning of high-dimensional heterogeneous data with applications to opioid use disorder study of 15,000 patients across 5 clinical sites. Sci Rep 2022; 12:11073. [PMID: 35773438 PMCID: PMC9245877 DOI: 10.1038/s41598-022-14029-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 05/31/2022] [Indexed: 11/17/2022] Open
Abstract
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-order (ADAP1) and second-order gradients (ADAP2) of the objective function from collaborating sites to construct a surrogate objective function at the lead site, where model fitting is then completed with proper regularizations applied. We evaluate the performance of the proposed method using both simulation and a real-world application to study risk factors for opioid use disorder (OUD) using 15,000 patient data from the OneFlorida Clinical Research Consortium. Our results show that ADAP performs nearly the same as the pooled estimator but achieves higher estimation accuracy and better variable selection than the local and average estimators. Moreover, ADAP2 successfully handles heterogeneity in covariate distributions.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
| | - Alexis Ogdie
- Department of Medicine, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, 90096, USA
| | - Henry R Kranzler
- Department of Psychiatry, University of Pennsylvania Perelman School of Medicine and the VISN 4 MIRECC, Crescenz VAMC, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA.
| |
Collapse
|
9
|
Distributed learning for heterogeneous clinical data with application to integrating COVID-19 data across 230 sites. NPJ Digit Med 2022; 5:76. [PMID: 35701668 PMCID: PMC9198031 DOI: 10.1038/s41746-022-00615-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 05/19/2022] [Indexed: 11/09/2022] Open
Abstract
Integrating real-world data (RWD) from several clinical sites offers great opportunities to improve estimation with a more general population compared to analyses based on a single clinical site. However, sharing patient-level data across sites is practically challenging due to concerns about maintaining patient privacy. We develop a distributed algorithm to integrate heterogeneous RWD from multiple clinical sites without sharing patient-level data. The proposed distributed conditional logistic regression (dCLR) algorithm can effectively account for between-site heterogeneity and requires only one round of communication. Our simulation study and data application with the data of 14,215 COVID-19 patients from 230 clinical sites in the UnitedHealth Group Clinical Research Database demonstrate that the proposed distributed algorithm provides an estimator that is robust to heterogeneity in event rates when efficiently integrating data from multiple clinical sites. Our algorithm is therefore a practical alternative to both meta-analysis and existing distributed algorithms for modeling heterogeneous multi-site binary outcomes.
Collapse
|