1
|
Tang AS, Rankin KP, Cerono G, Miramontes S, Mills H, Roger J, Zeng B, Nelson C, Soman K, Woldemariam S, Li Y, Lee A, Bove R, Glymour M, Aghaeepour N, Oskotsky TT, Miller Z, Allen IE, Sanders SJ, Baranzini S, Sirota M. Leveraging electronic health records and knowledge networks for Alzheimer's disease prediction and sex-specific biological insights. Nat Aging 2024; 4:379-395. [PMID: 38383858 PMCID: PMC10950787 DOI: 10.1038/s43587-024-00573-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 01/19/2024] [Indexed: 02/23/2024]
Abstract
Identification of Alzheimer's disease (AD) onset risk can facilitate interventions before irreversible disease progression. We demonstrate that electronic health records from the University of California, San Francisco, followed by knowledge networks (for example, SPOKE) allow for (1) prediction of AD onset and (2) prioritization of biological hypotheses, and (3) contextualization of sex dimorphism. We trained random forest models and predicted AD onset on a cohort of 749 individuals with AD and 250,545 controls with a mean area under the receiver operating characteristic of 0.72 (7 years prior) to 0.81 (1 day prior). We further harnessed matched cohort models to identify conditions with predictive power before AD onset. Knowledge networks highlight shared genes between multiple top predictors and AD (for example, APOE, ACTB, IL6 and INS). Genetic colocalization analysis supports AD association with hyperlipidemia at the APOE locus, as well as a stronger female AD association with osteoporosis at a locus near MS4A6A. We therefore show how clinical data can be utilized for early AD prediction and identification of personalized biological hypotheses.
Collapse
Affiliation(s)
- Alice S Tang
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, San Francisco and Berkeley, CA, USA.
| | - Katherine P Rankin
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Memory and Aging Center, Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Gabriel Cerono
- Weill Institute for Neuroscience. Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Silvia Miramontes
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Hunter Mills
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Jacquelyn Roger
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Billy Zeng
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Charlotte Nelson
- Weill Institute for Neuroscience. Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Karthik Soman
- Weill Institute for Neuroscience. Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Sarah Woldemariam
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Yaqiao Li
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Albert Lee
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Riley Bove
- Weill Institute for Neuroscience. Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Maria Glymour
- Department of Anesthesiology, Pain, and Perioperative Medicine, Stanford University, Palo Alto, CA, USA
| | - Nima Aghaeepour
- Department of Anesthesiology, Pain, and Perioperative Medicine, Stanford University, Palo Alto, CA, USA
- Department of Pediatrics, Stanford University, Palo Alto, CA, USA
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA
| | - Tomiko T Oskotsky
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Zachary Miller
- Memory and Aging Center, Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Isabel E Allen
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
| | - Stephan J Sanders
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Institute of Developmental and Regenerative Medicine, Department of Paediatrics, University of Oxford, Oxford, UK
- Department of Psychiatry and Behavioral Sciences, Weill Institute for Neurosciences, University of California, San Francisco, CA, USA
| | - Sergio Baranzini
- Weill Institute for Neuroscience. Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Marina Sirota
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Department of Pediatrics, University of California, San Francisco, CA, USA.
| |
Collapse
|
2
|
Cerono G, Melaiu O, Chicco D. Clinical Feature Ranking Based on Ensemble Machine Learning Reveals Top Survival Factors for Glioblastoma Multiforme. J Healthc Inform Res 2024; 8:1-18. [PMID: 38273986 PMCID: PMC10805687 DOI: 10.1007/s41666-023-00138-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 07/06/2023] [Accepted: 07/07/2023] [Indexed: 01/27/2024]
Abstract
Glioblastoma multiforme (GM) is a malignant tumor of the central nervous system considered to be highly aggressive and often carrying a terrible survival prognosis. An accurate prognosis is therefore pivotal for deciding a good treatment plan for patients. In this context, computational intelligence applied to data of electronic health records (EHRs) of patients diagnosed with this disease can be useful to predict the patients' survival time. In this study, we evaluated different machine learning models to predict survival time in patients suffering from glioblastoma and further investigated which features were the most predictive for survival time. We applied our computational methods to three different independent open datasets of EHRs of patients with glioblastoma: the Shieh dataset of 84 patients, the Berendsen dataset of 647 patients, and the Lammer dataset of 60 patients. Our survival time prediction techniques obtained concordance index (C-index) = 0.583 in the Shieh dataset, C-index = 0.776 in the Berendsen dataset, and C-index = 0.64 in the Lammer dataset, as best results in each dataset. Since the original studies regarding the three datasets analyzed here did not provide insights about the most predictive clinical features for survival time, we investigated the feature importance among these datasets. To this end, we then utilized Random Survival Forests, which is a decision tree-based algorithm able to model non-linear interaction between different features and might be able to better capture the highly complex clinical and genetic status of these patients. Our discoveries can impact clinical practice, aiding clinicians and patients alike to decide which therapy plan is best suited for their unique clinical status.
Collapse
Affiliation(s)
- Gabriel Cerono
- Department of Neurology, University of California San Francisco, San Francisco, CA USA
| | | | - Davide Chicco
- Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario Canada
| |
Collapse
|
3
|
Cerono G, Chicco D. Ensemble machine learning reveals key features for diabetes duration from electronic health records. PeerJ Comput Sci 2024; 10:e1896. [PMID: 38435625 PMCID: PMC10909161 DOI: 10.7717/peerj-cs.1896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 01/30/2024] [Indexed: 03/05/2024]
Abstract
Diabetes is a metabolic disorder that affects more than 420 million of people worldwide, and it is caused by the presence of a high level of sugar in blood for a long period. Diabetes can have serious long-term health consequences, such as cardiovascular diseases, strokes, chronic kidney diseases, foot ulcers, retinopathy, and others. Even if common, this disease is uneasy to spot, because it often comes with no symptoms. Especially for diabetes type 2, that happens mainly in the adults, knowing how long the diabetes has been present for a patient can have a strong impact on the treatment they can receive. This information, although pivotal, might be absent: for some patients, in fact, the year when they received the diabetes diagnosis might be well-known, but the year of the disease unset might be unknown. In this context, machine learning applied to electronic health records can be an effective tool to predict the past duration of diabetes for a patient. In this study, we applied a regression analysis based on several computational intelligence methods to a dataset of electronic health records of 73 patients with diabetes type 1 with 20 variables and another dataset of records of 400 patients of diabetes type 2 with 49 variables. Among the algorithms applied, Random Forests was able to outperform the other ones and to efficiently predict diabetes duration for both the cohorts, with the regression performances measured through the coefficient of determination R2. Afterwards, we applied the same method for feature ranking, and we detected the most relevant factors of the clinical records correlated with past diabetes duration: age, insulin intake, and body-mass index. Our study discoveries can have profound impact on clinical practice: when the information about the duration of diabetes of patient is missing, medical doctors can use our tool and focus on age, insulin intake, and body-mass index to infer this important aspect. Regarding limitations, unfortunately we were unable to find additional dataset of EHRs of patients with diabetes having the same variables of the two analyzed here, so we could not verify our findings on a validation cohort.
Collapse
Affiliation(s)
- Gabriel Cerono
- Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Canada
- Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy
| |
Collapse
|
4
|
Soman K, Nelson CA, Cerono G, Goldman SM, Baranzini SE, Brown EG. Early detection of Parkinson's disease through enriching the electronic health record using a biomedical knowledge graph. Front Med (Lausanne) 2023; 10:1081087. [PMID: 37250641 PMCID: PMC10217780 DOI: 10.3389/fmed.2023.1081087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 04/18/2023] [Indexed: 05/31/2023] Open
Abstract
Introduction Early diagnosis of Parkinson's disease (PD) is important to identify treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR). Methods To predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group. Results The classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification. Discussion The proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence.
Collapse
Affiliation(s)
- Karthik Soman
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Charlotte A. Nelson
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Gabriel Cerono
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Samuel M. Goldman
- Division of Occupational and Environmental Medicine, University of California, San Francisco, San Francisco, CA, United States
| | - Sergio E. Baranzini
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Ethan G. Brown
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| |
Collapse
|
5
|
Morris JH, Soman K, Akbas RE, Zhou X, Smith B, Meng EC, Huang CC, Cerono G, Schenk G, Rizk-Jackson A, Harroud A, Sanders L, Costes SV, Bharat K, Chakraborty A, Pico AR, Mardirossian T, Keiser M, Tang A, Hardi J, Shi Y, Musen M, Israni S, Huang S, Rose PW, Nelson CA, Baranzini SE. The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information. Bioinformatics 2023; 39:btad080. [PMID: 36759942 PMCID: PMC9940622 DOI: 10.1093/bioinformatics/btad080] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 01/17/2023] [Accepted: 02/08/2023] [Indexed: 02/11/2023] Open
Abstract
MOTIVATION Knowledge graphs (KGs) are being adopted in industry, commerce and academia. Biomedical KG presents a challenge due to the complexity, size and heterogeneity of the underlying information. RESULTS In this work, we present the Scalable Precision Medicine Open Knowledge Engine (SPOKE), a biomedical KG connecting millions of concepts via semantically meaningful relationships. SPOKE contains 27 million nodes of 21 different types and 53 million edges of 55 types downloaded from 41 databases. The graph is built on the framework of 11 ontologies that maintain its structure, enable mappings and facilitate navigation. SPOKE is built weekly by python scripts which download each resource, check for integrity and completeness, and then create a 'parent table' of nodes and edges. Graph queries are translated by a REST API and users can submit searches directly via an API or a graphical user interface. Conclusions/Significance: SPOKE enables the integration of seemingly disparate information to support precision medicine efforts. AVAILABILITY AND IMPLEMENTATION The SPOKE neighborhood explorer is available at https://spoke.rbvi.ucsf.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- John H Morris
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Karthik Soman
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Rabia E Akbas
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Xiaoyuan Zhou
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Brett Smith
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Elaine C Meng
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Conrad C Huang
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Gabriel Cerono
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Gundolf Schenk
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Angela Rizk-Jackson
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Adil Harroud
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Lauren Sanders
- Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA 94035, USA
| | - Sylvain V Costes
- Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA 94035, USA
| | - Krish Bharat
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Arjun Chakraborty
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Alexander R Pico
- Data Science and Biotechnology, Gladstone Institutes, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Taline Mardirossian
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA 94143-2550, USA
| | - Michael Keiser
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA 94143-2550, USA
| | - Alice Tang
- UCSF-UC Berkeley Bioengineering Graduate Program, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Josef Hardi
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305-5479, USA
| | - Yongmei Shi
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Mark Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305-5479, USA
| | - Sharat Israni
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Sui Huang
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Peter W Rose
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Charlotte A Nelson
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Sergio E Baranzini
- Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
6
|
Soman K, Nelson CA, Cerono G, Baranzini SE. Time-aware Embeddings of Clinical Data using a Knowledge Graph. Pac Symp Biocomput 2023; 28:97-108. [PMID: 36540968 PMCID: PMC9782808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Meaningful representations of clinical data using embedding vectors is a pivotal step to invoke any machine learning (ML) algorithm for data inference. In this article, we propose a time-aware embedding approach of electronic health records onto a biomedical knowledge graph for creating machine readable patient representations. This approach not only captures the temporal dynamics of patient clinical trajectories, but also enriches it with additional biological information from the knowledge graph. To gauge the predictivity of this approach, we propose an ML pipeline called TANDEM (Temporal and Non-temporal Dynamics Embedded Model) and apply it on the early detection of Parkinson's disease. TANDEM results in a classification AUC score of 0.85 on unseen test dataset. These predictions are further explained by providing a biological insight using the knowledge graph. Taken together, we show that temporal embeddings of clinical data could be a meaningful predictive representation for downstream ML pipelines in clinical decision-making.
Collapse
|