1
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
2
|
Steiert D, Wittig C, Banerjee P, Preissner R, Szulcek R. An exploration into CTEPH medications: Combining natural language processing, embedding learning, in vitro models, and real-world evidence for drug repurposing. PLoS Comput Biol 2024; 20:e1012417. [PMID: 39264975 PMCID: PMC11478854 DOI: 10.1371/journal.pcbi.1012417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 10/15/2024] [Accepted: 08/14/2024] [Indexed: 09/14/2024] Open
Abstract
BACKGROUND In the modern era, the growth of scientific literature presents a daunting challenge for researchers to keep informed of advancements across multiple disciplines. OBJECTIVE We apply natural language processing (NLP) and embedding learning concepts to design PubDigest, a tool that combs PubMed literature, aiming to pinpoint potential drugs that could be repurposed. METHODS Using NLP, especially term associations through word embeddings, we explored unrecognized relationships between drugs and diseases. To illustrate the utility of PubDigest, we focused on chronic thromboembolic pulmonary hypertension (CTEPH), a rare disease with an overall limited number of scientific publications. RESULTS Our literature analysis identified key clinical features linked to CTEPH by applying term frequency-inverse document frequency (TF-IDF) scoring, a technique measuring a term's significance in a text corpus. This allowed us to map related diseases. One standout was venous thrombosis (VT), which showed strong semantic links with CTEPH. Looking deeper, we discovered potential repurposing candidates for CTEPH through large-scale neural network-based contextualization of literature and predictive modeling on both the CTEPH and the VT literature corpora to find novel, yet unrecognized associations between the two diseases. Alongside the anti-thrombotic agent caplacizumab, benzofuran derivatives were an intriguing find. In particular, the benzofuran derivative amiodarone displayed potential anti-thrombotic properties in the literature. Our in vitro tests confirmed amiodarone's ability to reduce platelet aggregation significantly by 68% (p = 0.02). However, real-world clinical data indicated that CTEPH patients receiving amiodarone treatment faced a significant 15.9% higher mortality risk (p<0.001). CONCLUSIONS While NLP offers an innovative approach to interpreting scientific literature, especially for drug repurposing, it is crucial to combine it with complementary methods like in vitro testing and real-world evidence. Our exploration with benzofuran derivatives and CTEPH underscores this point. Thus, blending NLP with hands-on experiments and real-world clinical data can pave the way for faster and safer drug repurposing approaches, especially for rare diseases like CTEPH.
Collapse
Affiliation(s)
- Daniel Steiert
- Laboratory of in vitro modeling systems of pulmonary and thrombotic diseases, Institute of Physiology, Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Corey Wittig
- Laboratory of in vitro modeling systems of pulmonary and thrombotic diseases, Institute of Physiology, Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Priyanka Banerjee
- Structural Bioinformatics Group, Institute of Physiology, Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Robert Preissner
- Structural Bioinformatics Group, Institute of Physiology, Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Robert Szulcek
- Laboratory of in vitro modeling systems of pulmonary and thrombotic diseases, Institute of Physiology, Charité–Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Deutsches Herzzentrum der Charité, Department of Cardiac Anesthesiology and Intensive Care Medicine, Berlin, Germany
| |
Collapse
|
3
|
Martinez-Rico JR, Araujo L, Martinez-Romo J. Building a framework for fake news detection in the health domain. PLoS One 2024; 19:e0305362. [PMID: 38976665 PMCID: PMC11230534 DOI: 10.1371/journal.pone.0305362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 05/28/2024] [Indexed: 07/10/2024] Open
Abstract
Disinformation in the medical field is a growing problem that carries a significant risk. Therefore, it is crucial to detect and combat it effectively. In this article, we provide three elements to aid in this fight: 1) a new framework that collects health-related articles from verification entities and facilitates their check-worthiness and fact-checking annotation at the sentence level; 2) a corpus generated using this framework, composed of 10335 sentences annotated in these two concepts and grouped into 327 articles, which we call KEANE (faKe nEws At seNtence lEvel); and 3) a new model for verifying fake news that combines specific identifiers of the medical domain with triplets subject-predicate-object, using Transformers and feedforward neural networks at the sentence level. This model predicts the fact-checking of sentences and evaluates the veracity of the entire article. After training this model on our corpus, we achieved remarkable results in the binary classification of sentences (check-worthiness F1: 0.749, fact-checking F1: 0.698) and in the final classification of complete articles (F1: 0.703). We also tested its performance against another public dataset and found that it performed better than most systems evaluated on that dataset. Moreover, the corpus we provide differs from other existing corpora in its duality of sentence-article annotation, which can provide an additional level of justification of the prediction of truth or untruth made by the model.
Collapse
Affiliation(s)
- Juan R Martinez-Rico
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Lourdes Araujo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
- Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS), Madrid, Spain
| | - Juan Martinez-Romo
- NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
- Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS), Madrid, Spain
| |
Collapse
|
4
|
Patidar K, Deng JH, Mitchell CS, Ford Versypt AN. Cross-Domain Text Mining of Pathophysiological Processes Associated with Diabetic Kidney Disease. Int J Mol Sci 2024; 25:4503. [PMID: 38674089 PMCID: PMC11050166 DOI: 10.3390/ijms25084503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 04/16/2024] [Accepted: 04/17/2024] [Indexed: 04/28/2024] Open
Abstract
Diabetic kidney disease (DKD) is the leading cause of end-stage renal disease worldwide. This study's goal was to identify the signaling drivers and pathways that modulate glomerular endothelial dysfunction in DKD via artificial intelligence-enabled literature-based discovery. Cross-domain text mining of 33+ million PubMed articles was performed with SemNet 2.0 to identify and rank multi-scalar and multi-factorial pathophysiological concepts related to DKD. A set of identified relevant genes and proteins that regulate different pathological events associated with DKD were analyzed and ranked using normalized mean HeteSim scores. High-ranking genes and proteins intersected three domains-DKD, the immune response, and glomerular endothelial cells. The top 10% of ranked concepts were mapped to the following biological functions: angiogenesis, apoptotic processes, cell adhesion, chemotaxis, growth factor signaling, vascular permeability, the nitric oxide response, oxidative stress, the cytokine response, macrophage signaling, NFκB factor activity, the TLR pathway, glucose metabolism, the inflammatory response, the ERK/MAPK signaling response, the JAK/STAT pathway, the T-cell-mediated response, the WNT/β-catenin pathway, the renin-angiotensin system, and NADPH oxidase activity. High-ranking genes and proteins were used to generate a protein-protein interaction network. The study results prioritized interactions or molecules involved in dysregulated signaling in DKD, which can be further assessed through biochemical network models or experiments.
Collapse
Affiliation(s)
- Krutika Patidar
- Department of Chemical and Biological Engineering, University at Buffalo, Buffalo, NY 14260, USA
| | - Jennifer H. Deng
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - Cassie S. Mitchell
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
- Center for Machine Learning at Georgia Tech, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Ashlee N. Ford Versypt
- Department of Chemical and Biological Engineering, University at Buffalo, Buffalo, NY 14260, USA
- Department of Biomedical Engineering, University at Buffalo, Buffalo, NY 14260, USA
- Institute for Artificial Intelligence and Data Science, University at Buffalo, Buffalo, NY 14260, USA
| |
Collapse
|
5
|
Oss Boll H, Amirahmadi A, Ghazani MM, Morais WOD, Freitas EPD, Soliman A, Etminani F, Byttner S, Recamonde-Mendoza M. Graph neural networks for clinical risk prediction based on electronic health records: A survey. J Biomed Inform 2024; 151:104616. [PMID: 38423267 DOI: 10.1016/j.jbi.2024.104616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 02/21/2024] [Accepted: 02/23/2024] [Indexed: 03/02/2024]
Abstract
OBJECTIVE This study aims to comprehensively review the use of graph neural networks (GNNs) for clinical risk prediction based on electronic health records (EHRs). The primary goal is to provide an overview of the state-of-the-art of this subject, highlighting ongoing research efforts and identifying existing challenges in developing effective GNNs for improved prediction of clinical risks. METHODS A search was conducted in the Scopus, PubMed, ACM Digital Library, and Embase databases to identify relevant English-language papers that used GNNs for clinical risk prediction based on EHR data. The study includes original research papers published between January 2009 and May 2023. RESULTS Following the initial screening process, 50 articles were included in the data collection. A significant increase in publications from 2020 was observed, with most selected papers focusing on diagnosis prediction (n = 36). The study revealed that the graph attention network (GAT) (n = 19) was the most prevalent architecture, and MIMIC-III (n = 23) was the most common data resource. CONCLUSION GNNs are relevant tools for predicting clinical risk by accounting for the relational aspects among medical events and entities and managing large volumes of EHR data. Future studies in this area may address challenges such as EHR data heterogeneity, multimodality, and model interpretability, aiming to develop more holistic GNN models that can produce more accurate predictions, be effectively implemented in clinical settings, and ultimately improve patient care.
Collapse
Affiliation(s)
- Heloísa Oss Boll
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Avenida Bento Gonçalves, 9500, Porto Alegre, 91501-970, RS, Brazil; School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden.
| | - Ali Amirahmadi
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Mirfarid Musavian Ghazani
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Wagner Ourique de Morais
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Edison Pignaton de Freitas
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Avenida Bento Gonçalves, 9500, Porto Alegre, 91501-970, RS, Brazil
| | - Amira Soliman
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Farzaneh Etminani
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Stefan Byttner
- School of Information Technology, Halmstad University, Kristian IV:s väg 3, Halmstad, 301 18, Sweden
| | - Mariana Recamonde-Mendoza
- Institute of Informatics, Universidade Federal do Rio Grande do Sul, Avenida Bento Gonçalves, 9500, Porto Alegre, 91501-970, RS, Brazil; Bioinformatics Core, Hospital de Clínicas de Porto Alegre (HCPA), Av. Protásio Alves, 211, Bloco C, Porto Alegre, 90035-903, RS, Brazil
| |
Collapse
|
6
|
Feng X, Zhu S, Shen Y, Zhu H, Yan M, Cai G, Ning G. Multi-organ spatiotemporal information aware model for sepsis mortality prediction. Artif Intell Med 2024; 147:102746. [PMID: 38184353 DOI: 10.1016/j.artmed.2023.102746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 12/05/2023] [Accepted: 12/05/2023] [Indexed: 01/08/2024]
Abstract
BACKGROUND Sepsis is a syndrome involving multi-organ dysfunction, and the mortality in sepsis patients correlates with the number of lesioned organs. Precise prognosis models play a pivotal role in enabling healthcare practitioners to administer timely and accurate interventions for sepsis, thereby augmenting patient outcomes. Nevertheless, the majority of available models consider the overall physiological attributes of patients, overlooking the asynchronous spatiotemporal interactions among multiple organ systems. These constraints hinder a full application of such models, particularly when dealing with limited clinical data. To surmount these challenges, a comprehensive model, denoted as recurrent Graph Attention Network-multi Gated Recurrent Unit (rGAT-mGRU), was proposed. Taking into account the intricate spatiotemporal interactions among multiple organ systems, the model predicted in-hospital mortality of sepsis using data collected within the 48-hour period post-diagnosis. MATERIAL AND METHODS Multiple parallel GRU sub-models were formulated to investigate the temporal physiological variations of single organ systems. Meanwhile, a GAT structure featuring a memory unit was constructed to capture spatiotemporal connections among multi-organ systems. Additionally, an attention-injection mechanism was employed to govern the data flowing within the network pertaining to multi-organ systems. The proposed model underwent training and testing using a dataset of 10,181 sepsis cases extracted from the Medical Information Mart for Intensive Care III (MIMIC-III) database. To evaluate the model's superiority, it was compared with the existing common baseline models. Furthermore, ablation experiments were designed to elucidate the rationale and robustness of the proposed model. RESULTS Compared with the baseline models for predicting mortality of sepsis, the rGAT-mGRU model demonstrated the largest area under the receiver operating characteristic curve (AUROC) of 0.8777 ± 0.0039 and the maximum area under the precision-recall curve (AUPRC) of 0.5818 ± 0.0071, with sensitivity of 0.8358 ± 0.0302 and specificity of 0.7727 ± 0.0229, respectively. The proposed model was capable of delineating the varying contribution of the involved organ systems at distinct moments, as specifically illustrated by the attention weights. Furthermore, it exhibited consistent performance even in the face of limited clinical data. CONCLUSION The rGAT-mGRU model has the potential to indicate sepsis prognosis by extracting the dynamic spatiotemporal interplay information inherent in multi-organ systems during critical diseases, thereby providing clinicians with auxiliary decision-making support.
Collapse
Affiliation(s)
- Xue Feng
- Department of Biomedical Engineering, Zhejiang University, Hangzhou 310027, China
| | - Siyi Zhu
- Department of Biomedical Engineering, Zhejiang University, Hangzhou 310027, China
| | - Yanfei Shen
- Intensive Care Unit, Zhejiang Hospital, Hangzhou 310013, China
| | - Huaiping Zhu
- Department of Mathematics and Statistics, York University, Toronto M3J1P3, Canada
| | - Molei Yan
- Intensive Care Unit, Zhejiang Hospital, Hangzhou 310013, China
| | - Guolong Cai
- Intensive Care Unit, Zhejiang Hospital, Hangzhou 310013, China.
| | - Gangmin Ning
- Department of Biomedical Engineering, Zhejiang University, Hangzhou 310027, China; Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou 311121, China.
| |
Collapse
|
7
|
Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. BioKG: a comprehensive, large-scale biomedical knowledge graph for AI-powered, data-driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
To cope with the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have emerged as a powerful data structure for integrating large volumes of heterogeneous data to facilitate accurate and efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured content from scientific literature into KGs has remained a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge to construct a largescale KG using all PubMed abstracts. The quality of the large-scale information extraction rivals that of human expert annotations, signaling a new era of automatic, high-quality database construction from literature. Our extracted information markedly surpasses the amount of content in manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. The comprehensive KG enabled rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and achieved unprecedented results for drug target identification and drug repurposing. Taking lung cancer as an example, we found that 40% of drug targets reported in literature could have been predicted by our algorithm about 15 years ago in a retrospective study, demonstrating that substantial acceleration in scientific discovery could be achieved through automated hypotheses generation and timely dissemination. A cloud-based platform (https://www.biokde.com) was developed for academic users to freely access this rich structured data and associated tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Xin Sui
- Insilicom LLC, Tallahassee, FL 32303
| | - Feng Pan
- Insilicom LLC, Tallahassee, FL 32303
| | | | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Jian Wang
- 977 Wisteria Ter., Sunnyvale, CA 94086
| | | | | | - Jun Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Eric Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Ben Lee
- Insilicom LLC, Tallahassee, FL 32303
| | - Peili Zhang
- Forward Informatics, Winchester, Massachusetts, 01890
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL 32306
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| |
Collapse
|
8
|
Gonzalez-Cavazos AC, Tanska A, Mayers M, Carvalho-Silva D, Sridharan B, Rewers PA, Sankarlal U, Jagannathan L, Su AI. DrugMechDB: A Curated Database of Drug Mechanisms. Sci Data 2023; 10:632. [PMID: 37717042 PMCID: PMC10505144 DOI: 10.1038/s41597-023-02534-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 09/01/2023] [Indexed: 09/18/2023] Open
Abstract
Computational drug repositioning methods have emerged as an attractive and effective solution to find new candidates for existing therapies, reducing the time and cost of drug development. Repositioning methods based on biomedical knowledge graphs typically offer useful supporting biological evidence. This evidence is based on reasoning chains or subgraphs that connect a drug to a disease prediction. However, there are no databases of drug mechanisms that can be used to train and evaluate such methods. Here, we introduce the Drug Mechanism Database (DrugMechDB), a manually curated database that describes drug mechanisms as paths through a knowledge graph. DrugMechDB integrates a diverse range of authoritative free-text resources to describe 4,583 drug indications with 32,249 relationships, representing 14 major biological scales. DrugMechDB can be employed as a benchmark dataset for assessing computational drug repositioning models or as a valuable resource for training such models.
Collapse
Affiliation(s)
- Adriana Carolina Gonzalez-Cavazos
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Anna Tanska
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Michael Mayers
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Denise Carvalho-Silva
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Brindha Sridharan
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Patrick A Rewers
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Umasri Sankarlal
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Lakshmanan Jagannathan
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA
| | - Andrew I Su
- The Scripps Research Institute, Department of Integrative Structural and Computational Biology, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA.
| |
Collapse
|
9
|
Hou Y, Yeung J, Xu H, Su C, Wang F, Zhang R. From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs. RESEARCH SQUARE 2023:rs.3.rs-3185632. [PMID: 37577545 PMCID: PMC10418534 DOI: 10.21203/rs.3.rs-3185632/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Purpose Large Language Models (LLMs) have shown exceptional performance in various natural language processing tasks, benefiting from their language generation capabilities and ability to acquire knowledge from unstructured text. However, in the biomedical domain, LLMs face limitations that lead to inaccurate and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for organizing structured information. Biomedical Knowledge Graphs (BKGs) have gained significant attention for managing diverse and large-scale biomedical knowledge. The objective of this study is to assess and compare the capabilities of ChatGPT and existing BKGs in question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain. Methods We conducted a series of experiments to assess the performance of ChatGPT and the BKGs in various aspects of querying existing biomedical knowledge, knowledge discovery, and knowledge reasoning. Firstly, we tasked ChatGPT with answering questions sourced from the "Alternative Medicine" sub-category of Yahoo! Answers and recorded the responses. Additionally, we queried BKG to retrieve the relevant knowledge records corresponding to the questions and assessed them manually. In another experiment, we formulated a prediction scenario to assess ChatGPT's ability to suggest potential drug/dietary supplement repurposing candidates. Simultaneously, we utilized BKG to perform link prediction for the same task. The outcomes of ChatGPT and BKG were compared and analyzed. Furthermore, we evaluated ChatGPT and BKG's capabilities in establishing associations between pairs of proposed entities. This evaluation aimed to assess their reasoning abilities and the extent to which they can infer connections within the knowledge domain. Results The results indicate that ChatGPT with GPT-4.0 outperforms both GPT-3.5 and BKGs in providing existing information. However, BKGs demonstrate higher reliability in terms of information accuracy. ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. Conclusions To address the limitations observed, future research should focus on integrating LLMs and BKGs to leverage the strengths of both approaches. Such integration would optimize task performance and mitigate potential risks, leading to advancements in knowledge within the biomedical field and contributing to the overall well-being of individuals.
Collapse
|
10
|
Hou Y, Yeung J, Xu H, Su C, Wang F, Zhang R. From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.06.09.23291208. [PMID: 37398259 PMCID: PMC10312889 DOI: 10.1101/2023.06.09.23291208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, utilizing their language generation capabilities and knowledge acquisition potential from unstructured text. However, when applied to the biomedical domain, LLMs encounter limitations, resulting in erroneous and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for structured information representation and organization. Specifically, Biomedical Knowledge Graphs (BKGs) have attracted significant interest in managing large-scale and heterogeneous biomedical knowledge. This study evaluates the capabilities of ChatGPT and existing BKGs in question answering, knowledge discovery, and reasoning. Results indicate that while ChatGPT with GPT-4.0 surpasses both GPT-3.5 and BKGs in providing existing information, BKGs demonstrate superior information reliability. Additionally, ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. To overcome these limitations, future research should focus on integrating LLMs and BKGs to leverage their respective strengths. Such an integrated approach would optimize task performance and mitigate potential risks, thereby advancing knowledge in the biomedical field and contributing to overall well-being.
Collapse
Affiliation(s)
- Yu Hou
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| | - Jeremy Yeung
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale University, New Haven, Connecticut, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
11
|
Murali L, Gopakumar G, Viswanathan DM, Nedungadi P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study. J Biomed Inform 2023:104403. [PMID: 37230406 DOI: 10.1016/j.jbi.2023.104403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 05/16/2023] [Accepted: 05/19/2023] [Indexed: 05/27/2023]
Abstract
With the growth of data and intelligent technologies, the healthcare sector opened numerous technology that enabled services for patients, clinicians, and researchers. One major hurdle in achieving state-of-the-art results in health informatics is domain-specific terminologies and their semantic complexities. A knowledge graph crafted from medical concepts, events, and relationships acts as a medical semantic network to extract new links and hidden patterns from health data sources. Current medical knowledge graph construction studies are limited to generic techniques and opportunities and focus less on exploiting real-world data sources in knowledge graph construction. A knowledge graph constructed from Electronic Health Records (EHR) data obtains real-world data from healthcare records. It ensures better results in subsequent tasks like knowledge extraction and inference, knowledge graph completion, and medical knowledge graph applications such as diagnosis predictions, clinical recommendations, and clinical decision support. This review critically analyses existing works on medical knowledge graphs that used EHR data as the data source at (i) representation level, (ii) extraction level (iii) completion level. In this investigation, we found that EHR-based knowledge graph construction involves challenges such as high complexity and dimensionality of data, lack of knowledge fusion, and dynamic update of the knowledge graph. In addition, the study presents possible ways to tackle the challenges identified. Our findings conclude that future research should focus on knowledge graph integration and knowledge graph completion challenges.
Collapse
Affiliation(s)
- Lino Murali
- Center for Research in Analytics and Technologies for Education (CREATE), Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India; Division of Information technology, School of Engineering, Cochin University of Science and Technology, Kochi, 682022, Kerala, India
| | - G Gopakumar
- Department of Computer Science and Engineering, School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India
| | - Daleesha M Viswanathan
- Division of Information technology, School of Engineering, Cochin University of Science and Technology, Kochi, 682022, Kerala, India
| | - Prema Nedungadi
- Center for Research in Analytics and Technologies for Education (CREATE), Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India; Department of Computer Science and Engineering, School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India.
| |
Collapse
|
12
|
Su C, Hou Y, Zhou M, Rajendran S, Maasch JRA, Abedi Z, Zhang H, Bai Z, Cuturrufo A, Guo W, Chaudhry FF, Ghahramani G, Tang J, Cheng F, Li Y, Zhang R, DeKosky ST, Bian J, Wang F. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). iScience 2023; 26:106460. [PMID: 37020958 PMCID: PMC10068563 DOI: 10.1016/j.isci.2023.106460] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/20/2022] [Accepted: 03/16/2023] [Indexed: 04/01/2023] Open
Abstract
The abundance of biomedical knowledge gained from biological experiments and clinical practices is an invaluable resource for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In this study, we created a comprehensive BKG called the integrative Biomedical Knowledge Hub (iBKH) by harmonizing and integrating information from diverse biomedical resources. To make iBKH easily accessible for biomedical research, we developed a web-based, user-friendly graphical portal that allows fast and interactive knowledge retrieval. Additionally, we also implemented an efficient and scalable graph learning pipeline for discovering novel biomedical knowledge in iBKH. As a proof of concept, we performed our iBKH-based method for computational in-silico drug repurposing for Alzheimer's disease. The iBKH is publicly available.
Collapse
Affiliation(s)
- Chang Su
- Department of Health Service Administration and Policy, College of Public Health, Temple University, Philadelphia, PA 19122, USA
| | - Yu Hou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Manqi Zhou
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, New York, NY 10065, USA
| | | | - Zehra Abedi
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Haotan Zhang
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | | | - Winston Guo
- Department of Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Fayzan F. Chaudhry
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Gregory Ghahramani
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec AI Institute and HEC Montreal, Montreal, QC H2S 3H1, Canada
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC H3A 0C6, Canada
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Steven T. DeKosky
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
13
|
Yang Y, Lu Y, Yan W. A comprehensive review on knowledge graphs for complex diseases. Brief Bioinform 2023; 24:6931722. [PMID: 36528805 DOI: 10.1093/bib/bbac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 11/02/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Medical College of Soochow University, and Center for Systems Biology, Soochow University, Suzhou 215123, China
| |
Collapse
|
14
|
Weber L, Sänger M, Garda S, Barth F, Alt C, Leser U. Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models. Database (Oxford) 2022; 2022:6833204. [PMID: 36399413 PMCID: PMC9674024 DOI: 10.1093/database/baac098] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 10/18/2022] [Accepted: 10/21/2022] [Indexed: 11/19/2022]
Abstract
The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.
Collapse
Affiliation(s)
- Leon Weber
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| | - Mario Sänger
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Fabio Barth
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Christoph Alt
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany,Research Cluster of Excellence, Science of Intelligence, Marchstr. 23, Berlin 10587, Germany
| | - Ulf Leser
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| |
Collapse
|
15
|
Darwish O, Tashtoush Y, Bashayreh A, Alomar A, Alkhaza’leh S, Darweesh D. A survey of uncover misleading and cyberbullying on social media for public health. CLUSTER COMPUTING 2022; 26:1709-1735. [PMID: 36034676 PMCID: PMC9396598 DOI: 10.1007/s10586-022-03706-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 07/18/2022] [Accepted: 08/08/2022] [Indexed: 05/25/2023]
Abstract
Misleading health information is a critical phenomenon in our modern life due to advance in technology. In fact, social media facilitated the dissemination of information, and as a result, misinformation spread rapidly, cheaply, and successfully. Fake health information can have a significant effect on human behavior and attitudes. This survey presents the current works developed for misleading information detection (MLID) in health fields based on machine learning and deep learning techniques and introduces a detailed discussion of the main phases of the generic adopted approach for MLID. In addition, we highlight the benchmarking datasets and the most used metrics to evaluate the performance of MLID algorithms are discussed and finally, a deep investigation of the limitations and drawbacks of the current progressing technologies in various research directions is provided to help the researchers to use the most proper methods in this emerging task of MLID.
Collapse
Affiliation(s)
- Omar Darwish
- Information Security and Applied Computing, Eastern Michigan University, 900 Oakwood St, Ypsilanti, MI 48197 USA
| | - Yahya Tashtoush
- Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110 Jordan
| | - Amjad Bashayreh
- Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110 Jordan
| | - Alaa Alomar
- Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110 Jordan
| | - Shahed Alkhaza’leh
- Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110 Jordan
| | - Dirar Darweesh
- Department of Computer Science, Jordan University of Science and Technology, Irbid, 22110 Jordan
| |
Collapse
|
16
|
Youn J, Rai N, Tagkopoulos I. Knowledge integration and decision support for accelerated discovery of antibiotic resistance genes. Nat Commun 2022; 13:2360. [PMID: 35487919 PMCID: PMC9055065 DOI: 10.1038/s41467-022-29993-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 03/04/2022] [Indexed: 11/09/2022] Open
Abstract
We present a machine learning framework to automate knowledge discovery through knowledge graph construction, inconsistency resolution, and iterative link prediction. By incorporating knowledge from 10 publicly available sources, we construct an Escherichia coli antibiotic resistance knowledge graph with 651,758 triples from 23 triple types after resolving 236 sets of inconsistencies. Iteratively applying link prediction to this graph and wet-lab validation of the generated hypotheses reveal 15 antibiotic resistant E. coli genes, with 6 of them never associated with antibiotic resistance for any microbe. Iterative link prediction leads to a performance improvement and more findings. The probability of positive findings highly correlates with experimentally validated findings (R2 = 0.94). We also identify 5 homologs in Salmonella enterica that are all validated to confer resistance to antibiotics. This work demonstrates how evidence-driven decisions are a step toward automating knowledge discovery with high confidence and accelerated pace, thereby substituting traditional time-consuming and expensive methods.
Collapse
Affiliation(s)
- Jason Youn
- Department of Computer Science, University of California, Davis, CA, 95616, USA
- Genome Center, University of California, Davis, CA, 95616, USA
- USDA/NSF AI Institute for Next Generation Food Systems (AIFS), University of California, Davis, CA, 95616, USA
| | - Navneet Rai
- Department of Computer Science, University of California, Davis, CA, 95616, USA
- Genome Center, University of California, Davis, CA, 95616, USA
- USDA/NSF AI Institute for Next Generation Food Systems (AIFS), University of California, Davis, CA, 95616, USA
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, CA, 95616, USA.
- Genome Center, University of California, Davis, CA, 95616, USA.
- USDA/NSF AI Institute for Next Generation Food Systems (AIFS), University of California, Davis, CA, 95616, USA.
| |
Collapse
|
17
|
Morid MA, Sheng ORL, Dunbar J. Time Series Prediction Using Deep Learning Methods in Healthcare. ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS 2022. [DOI: 10.1145/3531326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Traditional Machine Learning (ML) methods face unique challenges when applied to healthcare predictive analytics. The high-dimensional nature of healthcare data necessitates labor-intensive and time-consuming processes when selecting an appropriate set of features for each new task. Furthermore, ML methods depend heavily on feature engineering to capture the sequential nature of patient data, oftentimes failing to adequately leverage the temporal patterns of medical events and their dependencies. In contrast, recent Deep Learning (DL) methods have shown promising performance for various healthcare prediction tasks by specifically addressing the high-dimensional and temporal challenges of medical data. DL techniques excel at learning useful representations of medical concepts and patient clinical data as well as their nonlinear interactions from high-dimensional raw or minimally-processed healthcare data.
In this paper we systematically reviewed research works that focused on advancing deep neural networks to leverage patient structured time series data for healthcare prediction tasks. To identify relevant studies, we searched MEDLINE, IEEE, Scopus, and ACM digital library for relevant publications through November 4
th
, 2021. Overall, we found that researchers have contributed to deep time series prediction literature in ten identifiable research streams: DL models, missing value handling, addressing temporal irregularity, patient representation, static data inclusion, attention mechanisms, interpretation, incorporation of medical ontologies, learning strategies, and scalability. This study summarizes research insights from these literature streams, identifies several critical research gaps, and suggests future research opportunities for DL applications using patient time series data.
Collapse
Affiliation(s)
- Mohammad Amin Morid
- Department of Information Systems and Analytics, Leavey School of Business, Santa Clara University, Santa Clara, CA, USA
| | - Olivia R. Liu Sheng
- Department of Operations and Information Systems, David Eccles School of Business, University of Utah, Salt Lake City, UT, USA
| | - Joseph Dunbar
- Department of Operations and Information Systems, David Eccles School of Business, University of Utah, Salt Lake City, UT, USA
| |
Collapse
|
18
|
Prostate cancer management with lifestyle intervention: From knowledge graph to Chatbot. CLINICAL AND TRANSLATIONAL DISCOVERY 2022. [DOI: 10.1002/ctd2.29] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
19
|
Saldívar-González FI, Aldas-Bulos VD, Medina-Franco JL, Plisson F. Natural product drug discovery in the artificial intelligence era. Chem Sci 2022; 13:1526-1546. [PMID: 35282622 PMCID: PMC8827052 DOI: 10.1039/d1sc04471k] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 12/10/2021] [Indexed: 12/19/2022] Open
Abstract
Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity.
Collapse
Affiliation(s)
- F I Saldívar-González
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - V D Aldas-Bulos
- Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| | - J L Medina-Franco
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - F Plisson
- CONACYT - Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| |
Collapse
|
20
|
Kumar S, Nanelia A, Mariappan R, Rajagopal A, Rajan V. Patient Representation Learning From Heterogeneous Data Sources and Knowledge Graphs Using Deep Collective Matrix Factorization: Evaluation Study. JMIR Med Inform 2022; 10:e28842. [PMID: 35049514 PMCID: PMC8814927 DOI: 10.2196/28842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 11/07/2021] [Accepted: 11/14/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Patient representation learning aims to learn features, also called representations, from input sources automatically, often in an unsupervised manner, for use in predictive models. This obviates the need for cumbersome, time- and resource-intensive manual feature engineering, especially from unstructured data such as text, images, or graphs. Most previous techniques have used neural network-based autoencoders to learn patient representations, primarily from clinical notes in electronic medical records (EMRs). Knowledge graphs (KGs), with clinical entities as nodes and their relations as edges, can be extracted automatically from biomedical literature and provide complementary information to EMR data that have been found to provide valuable predictive signals. OBJECTIVE This study aims to evaluate the efficacy of collective matrix factorization (CMF), both the classical variant and a recent neural architecture called deep CMF (DCMF), in integrating heterogeneous data sources from EMR and KG to obtain patient representations for clinical decision support tasks. METHODS Using a recent formulation for obtaining graph representations through matrix factorization within the context of CMF, we infused auxiliary information during patient representation learning. We also extended the DCMF architecture to create a task-specific end-to-end model that learns to simultaneously find effective patient representations and predictions. We compared the efficacy of such a model to that of first learning unsupervised representations and then independently learning a predictive model. We evaluated patient representation learning using CMF-based methods and autoencoders for 2 clinical decision support tasks on a large EMR data set. RESULTS Our experiments show that DCMF provides a seamless way for integrating multiple sources of data to obtain patient representations, both in unsupervised and supervised settings. Its performance in single-source settings is comparable with that of previous autoencoder-based representation learning methods. When DCMF is used to obtain representations from a combination of EMR and KG, where most previous autoencoder-based methods cannot be used directly, its performance is superior to that of previous nonneural methods for CMF. Infusing information from KGs into patient representations using DCMF was found to improve downstream predictive performance. CONCLUSIONS Our experiments indicate that DCMF is a versatile model that can be used to obtain representations from single and multiple data sources and combine information from EMR data and KGs. Furthermore, DCMF can be used to learn representations in both supervised and unsupervised settings. Thus, DCMF offers an effective way of integrating heterogeneous data sources and infusing auxiliary knowledge into patient representations.
Collapse
Affiliation(s)
| | - Alicia Nanelia
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | - Ragunathan Mariappan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
21
|
Leysen H, Walter D, Christiaenssen B, Vandoren R, Harputluoğlu İ, Van Loon N, Maudsley S. GPCRs Are Optimal Regulators of Complex Biological Systems and Orchestrate the Interface between Health and Disease. Int J Mol Sci 2021; 22:ijms222413387. [PMID: 34948182 PMCID: PMC8708147 DOI: 10.3390/ijms222413387] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/08/2021] [Accepted: 12/09/2021] [Indexed: 02/06/2023] Open
Abstract
GPCRs arguably represent the most effective current therapeutic targets for a plethora of diseases. GPCRs also possess a pivotal role in the regulation of the physiological balance between healthy and pathological conditions; thus, their importance in systems biology cannot be underestimated. The molecular diversity of GPCR signaling systems is likely to be closely associated with disease-associated changes in organismal tissue complexity and compartmentalization, thus enabling a nuanced GPCR-based capacity to interdict multiple disease pathomechanisms at a systemic level. GPCRs have been long considered as controllers of communication between tissues and cells. This communication involves the ligand-mediated control of cell surface receptors that then direct their stimuli to impact cell physiology. Given the tremendous success of GPCRs as therapeutic targets, considerable focus has been placed on the ability of these therapeutics to modulate diseases by acting at cell surface receptors. In the past decade, however, attention has focused upon how stable multiprotein GPCR superstructures, termed receptorsomes, both at the cell surface membrane and in the intracellular domain dictate and condition long-term GPCR activities associated with the regulation of protein expression patterns, cellular stress responses and DNA integrity management. The ability of these receptorsomes (often in the absence of typical cell surface ligands) to control complex cellular activities implicates them as key controllers of the functional balance between health and disease. A greater understanding of this function of GPCRs is likely to significantly augment our ability to further employ these proteins in a multitude of diseases.
Collapse
Affiliation(s)
- Hanne Leysen
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
| | - Deborah Walter
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
| | - Bregje Christiaenssen
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
| | - Romi Vandoren
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
| | - İrem Harputluoğlu
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
- Department of Chemistry, Middle East Technical University, Çankaya, Ankara 06800, Turkey
| | - Nore Van Loon
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
| | - Stuart Maudsley
- Receptor Biology Lab, University of Antwerp, 2610 Wilrijk, Belgium; (H.L.); (D.W.); (B.C.); (R.V.); (İ.H.); (N.V.L.)
- Correspondence:
| |
Collapse
|
22
|
Dasgupta S, Jayagopal A, Jun Hong AL, Mariappan R, Rajan V. Adverse Drug Event Prediction Using Noisy Literature-Derived Knowledge Graphs: Algorithm Development and Validation. JMIR Med Inform 2021; 9:e32730. [PMID: 34694230 PMCID: PMC8576589 DOI: 10.2196/32730] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 09/07/2021] [Accepted: 09/18/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Adverse drug events (ADEs) are unintended side effects of drugs that cause substantial clinical and economic burdens globally. Not all ADEs are discovered during clinical trials; therefore, postmarketing surveillance, called pharmacovigilance, is routinely conducted to find unknown ADEs. A wealth of information, which facilitates ADE discovery, lies in the growing body of biomedical literature. Knowledge graphs (KGs) encode information from the literature, where the vertices and the edges represent clinical concepts and their relations, respectively. The scale and unstructured form of the literature necessitates the use of natural language processing (NLP) to automatically create such KGs. Previous studies have demonstrated the utility of such literature-derived KGs in ADE prediction. Through unsupervised learning of the representations (features) of clinical concepts from the KG, which are used in machine learning models, state-of-the-art results for ADE prediction were obtained on benchmark data sets. OBJECTIVE Due to the use of NLP to infer literature-derived KGs, there is noise in the form of false positive (erroneous) and false negative (absent) nodes and edges. Previous representation learning methods do not account for such inaccuracies in the graph. NLP algorithms can quantify the confidence in their inference of extracted concepts and relations from the literature. Our hypothesis, which motivates this work, is that by using such confidence scores during representation learning, the learned embeddings would yield better features for ADE prediction models. METHODS We developed methods to use these confidence scores on two well-known representation learning methods-DeepWalk and Translating Embeddings for Modeling Multi-relational Data (TransE)-to develop their weighted versions: Weighted DeepWalk and Weighted TransE. These methods were used to learn representations from a large literature-derived KG, the Semantic MEDLINE Database, which contains more than 93 million clinical relations. They were compared with Embedding of Semantic Predications, which, to our knowledge, is the best reported representation learning method using the Semantic MEDLINE Database with state-of-the-art results for ADE prediction. Representations learned from different methods were used (separately) as features of drugs and diseases to build classification models for ADE prediction using benchmark data sets. The methods were compared rigorously over multiple cross-validation settings. RESULTS The weighted versions we designed were able to learn representations that yielded more accurate predictive models than the corresponding unweighted versions of both DeepWalk and TransE, as well as Embedding of Semantic Predications, in our experiments. There were performance improvements of up to 5.75% in the F1-score and 8.4% in the area under the receiver operating characteristic curve value, thus advancing the state of the art in ADE prediction from literature-derived KGs. CONCLUSIONS Our classification models can be used to aid pharmacovigilance teams in detecting potentially new ADEs. Our experiments demonstrate the importance of modeling inaccuracies in the inferred KGs for representation learning.
Collapse
Affiliation(s)
| | | | - Abel Lim Jun Hong
- School of Computing, National University of Singapore, Singapore, Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
23
|
Doğan T, Atas H, Joshi V, Atakan A, Rifaioglu A, Nalbat E, Nightingale A, Saidi R, Volynkin V, Zellner H, Cetin-Atalay R, Martin M, Atalay V. CROssBAR: comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res 2021; 49:e96. [PMID: 34181736 PMCID: PMC8450100 DOI: 10.1093/nar/gkab543] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 04/11/2021] [Accepted: 06/10/2021] [Indexed: 12/11/2022] Open
Abstract
Systemic analysis of available large-scale biological/biomedical data is critical for studying biological mechanisms, and developing novel and effective treatment approaches against diseases. However, different layers of the available data are produced using different technologies and scattered across individual computational resources without any explicit connections to each other, which hinders extensive and integrative multi-omics-based analysis. We aimed to address this issue by developing a new data integration/representation methodology and its application by constructing a biological data resource. CROssBAR is a comprehensive system that integrates large-scale biological/biomedical data from various resources and stores them in a NoSQL database. CROssBAR is enriched with the deep-learning-based prediction of relationships between numerous data entries, which is followed by the rigorous analysis of the enriched data to obtain biologically meaningful modules. These complex sets of entities and relationships are displayed to users via easy-to-interpret, interactive knowledge graphs within an open-access service. CROssBAR knowledge graphs incorporate relevant genes-proteins, molecular interactions, pathways, phenotypes, diseases, as well as known/predicted drugs and bioactive compounds, and they are constructed on-the-fly based on simple non-programmatic user queries. These intensely processed heterogeneous networks are expected to aid systems-level research, especially to infer biological mechanisms in relation to genes, proteins, their ligands, and diseases.
Collapse
Affiliation(s)
- Tunca Doğan
- Department of Computer Engineering, Hacettepe University, Ankara 06800, Turkey
- Institute of Informatics, Hacettepe University, Ankara 06800, Turkey
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Heval Atas
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
| | - Vishal Joshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Ahmet Atakan
- Department of Computer Engineering, METU, Ankara 06800, Turkey
- Department of Computer Engineering, EBYU, Erzincan 24002, Turkey
| | - Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, METU, Ankara 06800, Turkey
- Department of Computer Engineering, İskenderun Technical University, Hatay 31200, Turkey
| | - Esra Nalbat
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
| | - Andrew Nightingale
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Vladimir Volynkin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Hermann Zellner
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Rengul Cetin-Atalay
- Cancer Systems Biology Laboratory, Graduate School of Informatics, METU, Ankara 06800, Turkey
- Section of Pulmonary and Critical Care Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL–EBI), Hinxton, Cambridgeshire CB10 1SD, UK
| | - Volkan Atalay
- Department of Computer Engineering, METU, Ankara 06800, Turkey
| |
Collapse
|
24
|
Stojanov R, Popovski G, Cenikj G, Koroušić Seljak B, Eftimov T. A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation. J Med Internet Res 2021; 23:e28229. [PMID: 34383671 PMCID: PMC8415558 DOI: 10.2196/28229] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 03/13/2021] [Accepted: 05/06/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources. OBJECTIVE In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction. METHODS We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags. RESULTS All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%. CONCLUSIONS FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.
Collapse
Affiliation(s)
- Riste Stojanov
- Faculty of Computer Science and Engineering, Ss Cyril and Methodius, University- Skopje, Skopje, the Former Yugoslav Republic of Macedonia
| | - Gorjan Popovski
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Gjorgjina Cenikj
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
| | | | - Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
| |
Collapse
|
25
|
Yang X, Wu C, Nenadic G, Wang W, Lu K. Mining a stroke knowledge graph from literature. BMC Bioinformatics 2021; 22:387. [PMID: 34325669 PMCID: PMC8319697 DOI: 10.1186/s12859-021-04292-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Stroke has an acute onset and a high mortality rate, making it one of the most fatal diseases worldwide. Its underlying biology and treatments have been widely studied both in the "Western" biomedicine and the Traditional Chinese Medicine (TCM). However, these two approaches are often studied and reported in insolation, both in the literature and associated databases. RESULTS To aid research in finding effective prevention methods and treatments, we integrated knowledge from the literature and a number of databases (e.g. CID, TCMID, ETCM). We employed a suite of biomedical text mining (i.e. named-entity) approaches to identify mentions of genes, diseases, drugs, chemicals, symptoms, Chinese herbs and patent medicines, etc. in a large set of stroke papers from both biomedical and TCM domains. Then, using a combination of a rule-based approach with a pre-trained BioBERT model, we extracted and classified links and relationships among stroke-related entities as expressed in the literature. We construct StrokeKG, a knowledge graph includes almost 46 k nodes of nine types, and 157 k links of 30 types, connecting diseases, genes, symptoms, drugs, pathways, herbs, chemical, ingredients and patent medicine. CONCLUSIONS Our Stroke-KG can provide practical and reliable stroke-related knowledge to help with stroke-related research like exploring new directions for stroke research and ideas for drug repurposing and discovery. We make StrokeKG freely available at http://114.115.208.144:7474/browser/ (Please click "Connect" directly) and the source structured data for stroke at https://github.com/yangxi1016/Stroke.
Collapse
Affiliation(s)
- Xi Yang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Wei Wang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| | - Kai Lu
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| |
Collapse
|
26
|
Fang A, Lou P, Hu J, Zhao W, Feng M, Ren H, Chen X. Head and Tail Entity Fusion Model in Medical Knowledge Graph Construction: Case Study for Pituitary Adenoma. JMIR Med Inform 2021; 9:e28218. [PMID: 34057414 PMCID: PMC8367125 DOI: 10.2196/28218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 04/11/2021] [Accepted: 05/30/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Pituitary adenoma is one of the most common central nervous system tumors. The diagnosis and treatment of pituitary adenoma remain very difficult. Misdiagnosis and recurrence often occur, and experienced neurosurgeons are in serious shortage. A knowledge graph can help interns quickly understand the medical knowledge related to pituitary tumor. OBJECTIVE The aim of this study was to develop a data fusion method suitable for medical data using data of pituitary adenomas integrated from different sources. The overall goal was to construct a knowledge graph for pituitary adenoma (KGPA) to be used for knowledge discovery. METHODS A complete framework suitable for the construction of a medical knowledge graph was developed, which was used to build the KGPA. The schema of the KGPA was manually constructed. Information of pituitary adenoma was automatically extracted from Chinese electronic medical records (CEMRs) and medical websites through a conditional random field model and newly designed web wrappers. An entity fusion method is proposed based on the head-and-tail entity fusion model to fuse the data from heterogeneous sources. RESULTS Data were extracted from 300 CEMRs of pituitary adenoma and 4 health portals. Entity fusion was carried out using the proposed data fusion model. The F1 scores of the head and tail entity fusions were 97.32% and 98.57%, respectively. Triples from the constructed KGPA were selected for evaluation, demonstrating 95.4% accuracy. CONCLUSIONS This paper introduces an approach to fuse triples extracted from heterogeneous data sources, which can be used to build a knowledge graph. The evaluation results showed that the data in the KGPA are of high quality. The constructed KGPA can help physicians in clinical practice.
Collapse
Affiliation(s)
- An Fang
- Life Science College, Central South University, Changsha, China
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Pei Lou
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Jiahui Hu
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Wanqing Zhao
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Ming Feng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Huiling Ren
- Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China
| | - Xianlai Chen
- Big Data Institute, Central South University, Changsha, China
- National Engineering Lab for Medical Big Data Application Technology, Central South University, Changsha, China
| |
Collapse
|
27
|
Wang M, Wang H, Liu X, Ma X, Wang B. Drug-Drug Interaction Predictions via Knowledge Graph and Text Embedding: Instrument Validation Study. JMIR Med Inform 2021; 9:e28277. [PMID: 34185011 PMCID: PMC8277366 DOI: 10.2196/28277] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Revised: 04/29/2021] [Accepted: 05/05/2021] [Indexed: 11/23/2022] Open
Abstract
Background Minimizing adverse reactions caused by drug-drug interactions (DDIs) has always been a prominent research topic in clinical pharmacology. Detecting all possible interactions through clinical studies before a drug is released to the market is a demanding task. The power of big data is opening up new approaches to discovering various DDIs. However, these data contain a huge amount of noise and provide knowledge bases that are far from being complete or used with reliability. Most existing studies focus on predicting binary DDIs between drug pairs and ignore other interactions. Objective Leveraging both drug knowledge graphs and biomedical text is a promising pathway for rich and comprehensive DDI prediction, but it is not without issues. Our proposed model seeks to address the following challenges: data noise and incompleteness, data sparsity, and computational complexity. Methods We propose a novel framework, Predicting Rich DDI, to predict DDIs. The framework uses graph embedding to overcome data incompleteness and sparsity issues to make multiple DDI label predictions. First, a large-scale drug knowledge graph is generated from different sources. The knowledge graph is then embedded with comprehensive biomedical text into a common low-dimensional space. Finally, the learned embeddings are used to efficiently compute rich DDI information through a link prediction process. Results To validate the effectiveness of the proposed framework, extensive experiments were conducted on real-world data sets. The results demonstrate that our model outperforms several state-of-the-art baseline methods in terms of capability and accuracy. Conclusions We propose a novel framework, Predicting Rich DDI, to predict DDIs. Using rich DDI information, it can competently predict multiple labels for a pair of drugs across numerous domains, ranging from pharmacological mechanisms to side effects. To the best of our knowledge, this framework is the first to provide a joint translation-based embedding model that learns DDIs by integrating drug knowledge graphs and biomedical text simultaneously in a common low-dimensional space. The model also predicts DDIs using multiple labels rather than single or binary labels. Extensive experiments were conducted on real-world data sets to demonstrate the effectiveness and efficiency of the model. The results show our proposed framework outperforms several state-of-the-art baselines.
Collapse
Affiliation(s)
- Meng Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, China.,Key Laboratory of Computer Network and Information Integration, Southeast University, Nanjing, China
| | - Haofen Wang
- College of Design and Innovation, Tongji University, Shanghai, China
| | - Xing Liu
- Third Xiangya Hospital, Central South University, Changsha, China
| | - Xinyu Ma
- School of Computer Science and Engineering, Southeast University, Nanjing, China
| | - Beilun Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, China
| |
Collapse
|
28
|
Warikoo N, Chang YC, Hsu WL. LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations. Bioinformatics 2021; 37:404-412. [PMID: 32810217 DOI: 10.1093/bioinformatics/btaa721] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 06/30/2020] [Accepted: 08/13/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. RESULTS This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein interaction (PPI), drug-drug interaction and protein-bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. AVAILABILITY AND IMPLEMENTATION Github. https://github.com/warikoone/LBERT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Neha Warikoo
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan.,Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei 115, Taiwan.,Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei 106, Taiwan.,Clinical Big Data Research Center, Taipei Medical University, Taipei 110, Taiwan.,Pervasive AI Research Labs, Ministry of Science and Technology, Hsinchu City 300, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan.,Pervasive AI Research Labs, Ministry of Science and Technology, Hsinchu City 300, Taiwan
| |
Collapse
|
29
|
Sun P, Gu L. Fuzzy knowledge graph system for artificial intelligence-based smart education. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-189332] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Fuzzy knowledge graph system is a semantic network that reveals the relationships between entities, and a tool or methodology that can formally describe things in the real world and their relationships. Smart education is an educational concept or model that uses advanced information technology to build a smart environment, integrates theory and practice to build an educational framework for information age, and provides paths to practice it. Artificial intelligence (AI) is a comprehensive discipline developed by the interpenetration of computer science, cybernetics, information theory, linguistics, neurophysiology and other disciplines, which is a direction for the development of information technology in the future. On the basis of summarizing and analyzing of previous research works, this paper expounded the research status and significance of AI technology, elaborated the development background, current status and future challenges of the construction and application of fuzzy knowledge graph system for smart education, introduced the methods and principles of data acquisition methods and digitalized apprenticeship, realized the process design, information extraction, entity recognition and relationship mining of smart education, constructed a systematic framework for fuzzy knowledge graph, and analyzed the high-quality resources sharing and personalized service of AI-assisted smart education, discussed automatic knowledge acquisition and fusion of fuzzy knowledge graph, performed co-occurrence relationship analysis, and finally conducted application case analysis. The results show that the smart education knowledge graph for AI-assisted smart education can integrate teaching experience and domain knowledge of discipline experts, enhance explainable and robust machine intelligence for AI-assisted smart education, and provide data-driven and knowledge-driven information processing methods; it can also discover the analysis hotspots and main content of research objects through clustering of high-frequency topic words, reveal the corresponding research structure in depth, and then systematically explore its research dimensions, subject background and theoretical basis.
Collapse
Affiliation(s)
- Pingping Sun
- Business & Tourism Institute, Hangzhou Vocational & Technical College, Hangzhou, Zhejiang, China
| | - Lingang Gu
- Special Equipment Institute, Hangzhou Vocational & Technical College, Hangzhou, Zhejiang, China
| |
Collapse
|
30
|
Guluzade A, Kacupaj E, Maleshkova M. Demographic Aware Probabilistic Medical Knowledge Graph Embeddings of Electronic Medical Records. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-77211-6_48] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
31
|
Kang H, Li J, Wu M, Shen L, Hou L. Building a Pharmacogenomics Knowledge Model Toward Precision Medicine: Case Study in Melanoma. JMIR Med Inform 2020; 8:e20291. [PMID: 33084582 PMCID: PMC7641779 DOI: 10.2196/20291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 08/11/2020] [Accepted: 09/13/2020] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Many drugs do not work the same way for everyone owing to distinctions in their genes. Pharmacogenomics (PGx) aims to understand how genetic variants influence drug efficacy and toxicity. It is often considered one of the most actionable areas of the personalized medicine paradigm. However, little prior work has included in-depth explorations and descriptions of drug usage, dosage adjustment, and so on. OBJECTIVE We present a pharmacogenomics knowledge model to discover the hidden relationships between PGx entities such as drugs, genes, and diseases, especially details in precise medication. METHODS PGx open data such as DrugBank and RxNorm were integrated in this study, as well as drug labels published by the US Food and Drug Administration. We annotated 190 drug labels manually for entities and relationships. Based on the annotation results, we trained 3 different natural language processing models to complete entity recognition. Finally, the pharmacogenomics knowledge model was described in detail. RESULTS In entity recognition tasks, the Bidirectional Encoder Representations from Transformers-conditional random field model achieved better performance with micro-F1 score of 85.12%. The pharmacogenomics knowledge model in our study included 5 semantic types: drug, gene, disease, precise medication (population, daily dose, dose form, frequency, etc), and adverse reaction. Meanwhile, 26 semantic relationships were defined in detail. Taking melanoma caused by a BRAF gene mutation into consideration, the pharmacogenomics knowledge model covered 7 related drugs and 4846 triples were established in this case. All the corpora, relationship definitions, and triples were made publically available. CONCLUSIONS We highlighted the pharmacogenomics knowledge model as a scalable framework for clinicians and clinical pharmacists to adjust drug dosage according to patient-specific genetic variation, and for pharmaceutical researchers to develop new drugs. In the future, a series of other antitumor drugs and automatic relation extractions will be taken into consideration to further enhance our framework with more PGx linked data.
Collapse
Affiliation(s)
- Hongyu Kang
- Institute of Medical Information &Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China
- Department of Biomedical Engineering, School of Life Science, Beijing Institute of Technology, Beijing, China
| | - Jiao Li
- Institute of Medical Information &Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China
| | - Meng Wu
- Institute of Medical Information &Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China
| | - Liu Shen
- Institute of Medical Information &Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China
| | - Li Hou
- Institute of Medical Information &Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China
| |
Collapse
|
32
|
Li R, Yin C, Yang S, Qian B, Zhang P. Marrying Medical Domain Knowledge With Deep Learning on Electronic Health Records: A Deep Visual Analytics Approach. J Med Internet Res 2020; 22:e20645. [PMID: 32985996 PMCID: PMC7551124 DOI: 10.2196/20645] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 07/07/2020] [Accepted: 07/26/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Deep learning models have attracted significant interest from health care researchers during the last few decades. There have been many studies that apply deep learning to medical applications and achieve promising results. However, there are three limitations to the existing models: (1) most clinicians are unable to interpret the results from the existing models, (2) existing models cannot incorporate complicated medical domain knowledge (eg, a disease causes another disease), and (3) most existing models lack visual exploration and interaction. Both the electronic health record (EHR) data set and the deep model results are complex and abstract, which impedes clinicians from exploring and communicating with the model directly. OBJECTIVE The objective of this study is to develop an interpretable and accurate risk prediction model as well as an interactive clinical prediction system to support EHR data exploration, knowledge graph demonstration, and model interpretation. METHODS A domain-knowledge-guided recurrent neural network (DG-RNN) model is proposed to predict clinical risks. The model takes medical event sequences as input and incorporates medical domain knowledge by attending to a subgraph of the whole medical knowledge graph. A global pooling operation and a fully connected layer are used to output the clinical outcomes. The middle results and the parameters of the fully connected layer are helpful in identifying which medical events cause clinical risks. DG-Viz is also designed to support EHR data exploration, knowledge graph demonstration, and model interpretation. RESULTS We conducted both risk prediction experiments and a case study on a real-world data set. A total of 554 patients with heart failure and 1662 control patients without heart failure were selected from the data set. The experimental results show that the proposed DG-RNN outperforms the state-of-the-art approaches by approximately 1.5%. The case study demonstrates how our medical physician collaborator can effectively explore the data and interpret the prediction results using DG-Viz. CONCLUSIONS In this study, we present DG-Viz, an interactive clinical prediction system, which brings together the power of deep learning (ie, a DG-RNN-based model) and visual analytics to predict clinical risks and visually interpret the EHR prediction results. Experimental results and a case study on heart failure risk prediction tasks demonstrate the effectiveness and usefulness of the DG-Viz system. This study will pave the way for interactive, interpretable, and accurate clinical risk predictions.
Collapse
Affiliation(s)
- Rui Li
- The Ohio State University, Columbus, OH, United States
| | | | - Samuel Yang
- The Ohio State University, Columbus, OH, United States
- Nationwide Children's Hospital, Columbus, OH, United States
| | | | - Ping Zhang
- The Ohio State University, Columbus, OH, United States
| |
Collapse
|
33
|
Li N, Yang Z, Luo L, Wang L, Zhang Y, Lin H, Wang J. KGHC: a knowledge graph for hepatocellular carcinoma. BMC Med Inform Decis Mak 2020; 20:135. [PMID: 32646496 PMCID: PMC7346328 DOI: 10.1186/s12911-020-1112-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Hepatocellular carcinoma is one of the most general malignant neoplasms in adults with high mortality. Mining relative medical knowledge from rapidly growing text data and integrating it with other existing biomedical resources will provide support to the research on the hepatocellular carcinoma. To this purpose, we constructed a knowledge graph for Hepatocellular Carcinoma (KGHC). METHODS We propose an approach to build a knowledge graph for hepatocellular carcinoma. Specifically, we first extracted knowledge from structured data and unstructured data. Since the extracted entities may contain some noise, we applied a biomedical information extraction system, named BioIE, to filter the data in KGHC. Then we introduced a fusion method which is used to fuse the extracted data. Finally, we stored the data into the Neo4j which can help researchers analyze the network of hepatocellular carcinoma. RESULTS KGHC contains 13,296 triples and provides the knowledge of hepatocellular carcinoma for healthcare professionals, making them free of digging into a large amount of biomedical literatures. This could hopefully improve the efficiency of researches on the hepatocellular carcinoma. KGHC is accessible free for academic research purpose at http://202.118.75.18:18895/browser/ . CONCLUSIONS In this paper, we present a knowledge graph associated with hepatocellular carcinoma, which is constructed with vast amounts of structured and unstructured data. The evaluation results show that the data in KGHC is of high quality.
Collapse
Affiliation(s)
- Nan Li
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024 China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024 China
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024 China
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing, 100850 China
| | - Yin Zhang
- Beijing Institute of Health Administration and Medical Information, Beijing, 100850 China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024 China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024 China
| |
Collapse
|
34
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
35
|
|
36
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|
37
|
Frey LJ, Talbert DA. Artificial Intelligence Pipeline to Bridge the Gap between Bench Researchers and Clinical Researchers in Precision Medicine. MED ONE 2020; 5:10.20900/mo20200001. [PMID: 33511289 PMCID: PMC7839064 DOI: 10.20900/mo20200001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Precision medicine informatics is a field of research that incorporates learning systems that generate new knowledge to improve individualized treatments using integrated data sets and models. Given the ever-increasing volumes of data that are relevant to patient care, artificial intelligence (AI) pipelines need to be a central component of such research to speed discovery. Applying AI methodology to complex multidisciplinary information retrieval can support efforts to discover bridging concepts within collaborating communities. This dovetails with precision medicine research, given the information rich multi-omic data that are used in precision medicine analysis pipelines. In this perspective article we define a prototype AI pipeline to facilitate discovering research connections between bioinformatics and clinical researchers. We propose building knowledge representations that are iteratively improved through AI and human-informed learning feedback loops supported through crowdsourcing. To illustrate this, we will explore the specific use case of nonalcoholic fatty liver disease, a growing health care problem. We will examine AI pipeline construction and utilization in relation to bench-to-bedside bridging concepts with interconnecting knowledge representations applicable to bioinformatics researchers and clinicians.
Collapse
Affiliation(s)
- Lewis J. Frey
- Department of Public Health Science, Biomedical Informatics Center, Hollings Cancer Center, Medical University of South Carolina (MUSC), 135 Cannon St, Charleston, SC 29425, USA
- Health Equity and Rural Outreach Innovation Center (HEROIC), Ralph H. Johnson Veteran Affairs Medical Center, Charleston, SC 29401, USA
| | - Douglas A. Talbert
- Department of Computer Science, Tennessee Tech University (TTU), 1 William L Jones Dr, Cookeville, TN 38505, USA
| |
Collapse
|
38
|
Cuzzola J, Bagheri E, Jovanovic J. UMLS to DBPedia link discovery through circular resolution. J Am Med Inform Assoc 2019; 25:819-826. [PMID: 29648604 DOI: 10.1093/jamia/ocy021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2017] [Accepted: 02/26/2018] [Indexed: 11/14/2022] Open
Abstract
Objective The goal of this work is to map Unified Medical Language System (UMLS) concepts to DBpedia resources using widely accepted ontology relations from the Simple Knowledge Organization System (skos:exactMatch, skos:closeMatch) and from the Resource Description Framework Schema (rdfs:seeAlso), as a result of which a complete mapping from UMLS (UMLS 2016AA) to DBpedia (DBpedia 2015-10) is made publicly available that includes 221 690 skos:exactMatch, 26 276 skos:closeMatch, and 6 784 322 rdfs:seeAlso mappings. Methods We propose a method called circular resolution that utilizes a combination of semantic annotators to map UMLS concepts to DBpedia resources. A set of annotators annotate definitions of UMLS concepts returning DBpedia resources while another set performs annotation on DBpedia resource abstracts returning UMLS concepts. Our pipeline aligns these 2 sets of annotations to determine appropriate mappings from UMLS to DBpedia. Results We evaluate our proposed method using structured data from the Wikidata knowledge base as the ground truth, which consists of 4899 already existing UMLS to DBpedia mappings. Our results show an 83% recall with 77% precision-at-one (P@1) in mapping UMLS concepts to DBpedia resources on this testing set. Conclusions The proposed circular resolution method is a simple yet effective technique for linking UMLS concepts to DBpedia resources. Experiments using Wikidata-based ground truth reveal a high mapping accuracy. In addition to the complete UMLS mapping downloadable in n-triple format, we provide an online browser and a RESTful service to explore the mappings.
Collapse
Affiliation(s)
- John Cuzzola
- Laboratory for Systems, Software and Semantics (LS3), Ryerson University, Ontario, Canada
| | - Ebrahim Bagheri
- Laboratory for Systems, Software and Semantics (LS3), Ryerson University, Ontario, Canada
| | - Jelena Jovanovic
- Faculty of Organizational Sciences (FOS), University of Belgrade, Belgrade, Serbia
| |
Collapse
|
39
|
Abstract
Dangerous goods occupy an important proportion in international shipping, and government and enterprises pay a lot of attention to transport safety. There are a wide variety of dangerous goods, and the knowledge involved is extensive and complex. Organizing and managing this knowledge plays an important role in the safe transportation of dangerous goods. The knowledge graph is a mass of brand-new knowledge management technologies that provide powerful technical support for integrating domain knowledge and solving the problem of the “knowledge island.” This paper first introduces the knowledge of maritime dangerous goods (MDG); constructs a three-layer knowledge structure of MDG, dividing this knowledge into two categories; uses ontology to express the concepts, entities, and relations of MDG; and puts forward the representation methods of the conceptual layer and entity layer and designs them in detail. Finally, the knowledge graph of maritime dangerous goods (KGMDG) is constructed. Furthermore, we demonstrate the knowledge visualization, retrieval, and automatic judgment of segregation requirement based on KGMDG. It is proved that KGMDG does not only help to simplify the retrieval process of professional knowledge and to promote intelligent transportation but is also conducive to the sharing, dissemination, and utilization of MDG knowledge.
Collapse
|
40
|
Abstract
BACKGROUND Diabetes has become one of the hot topics in life science researches. To support the analytical procedures, researchers and analysts expend a mass of labor cost to collect experimental data, which is also error-prone. To reduce the cost and to ensure the data quality, there is a growing trend of extracting clinical events in form of knowledge from electronic medical records (EMRs). To do so, we first need a high-coverage knowledge base (KB) of a specific disease to support the above extraction tasks called KB-based Extraction. METHODS We propose an approach to build a diabetes-centric knowledge base (a.k.a. DKB) via mining the Web. In particular, we first extract knowledge from semi-structured contents of vertical portals, fuse individual knowledge from each site, and further map them to a unified KB. The target DKB is then extracted from the overall KB based on a distance-based Expectation-Maximization (EM) algorithm. RESULTS During the experiments, we selected eight popular vertical portals in China as data sources to construct DKB. There are 7703 instances and 96,041 edges in the final diabetes KB covering diseases, symptoms, western medicines, traditional Chinese medicines, examinations, departments, and body structures. The accuracy of DKB is 95.91%. Besides the quality assessment of extracted knowledge from vertical portals, we also carried out detailed experiments for evaluating the knowledge fusion performance as well as the convergence of the distance-based EM algorithm with positive results. CONCLUSIONS In this paper, we introduced an approach to constructing DKB. A knowledge extraction and fusion pipeline was first used to extract semi-structured data from vertical portals and individual KBs were further fused into a unified knowledge base. After that, we develop a distance based Expectation Maximization algorithm to extract a subset from the overall knowledge base forming the target DKB. Experiments showed that the data in DKB are rich and of high-quality.
Collapse
Affiliation(s)
- Fan Gong
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu’an Road, Shanghai, China
| | - Yilei Chen
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu’an Road, Shanghai, China
| | - Haofen Wang
- Shanghai Leyan Technologies Co. Ltd, No. 1028 Panyu Road, Shanghai, China
| | - Hao Lu
- Shanghai Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Pu’an Road, Shanghai, China
| |
Collapse
|
41
|
Yuan J, Jin Z, Guo H, Jin H, Zhang X, Smith T, Luo J. Constructing biomedical domain-specific knowledge graph with minimum supervision. Knowl Inf Syst 2019. [DOI: 10.1007/s10115-019-01351-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
42
|
Turki H, Hadj Taieb MA, Ben Aouicha M. MeSH qualifiers, publication types and relation occurrence frequency are also useful for a better sentence-level extraction of biomedical relations. J Biomed Inform 2018; 83:217-218. [DOI: 10.1016/j.jbi.2018.05.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2018] [Revised: 05/01/2018] [Accepted: 05/17/2018] [Indexed: 11/16/2022]
|
43
|
Ruan T, Wang M, Sun J, Wang T, Zeng L, Yin Y, Gao J. An automatic approach for constructing a knowledge base of symptoms in Chinese. J Biomed Semantics 2017; 8:33. [PMID: 29297414 PMCID: PMC5763289 DOI: 10.1186/s13326-017-0145-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background While a large number of well-known knowledge bases (KBs) in life science have been published as Linked Open Data, there are few KBs in Chinese. However, KBs in Chinese are necessary when we want to automatically process and analyze electronic medical records (EMRs) in Chinese. Of all, the symptom KB in Chinese is the most seriously in need, since symptoms are the starting point of clinical diagnosis. Results We publish a public KB of symptoms in Chinese, including symptoms, departments, diseases, medicines, and examinations as well as relations between symptoms and the above related entities. To the best of our knowledge, there is no such KB focusing on symptoms in Chinese, and the KB is an important supplement to existing medical resources. Our KB is constructed by fusing data automatically extracted from eight mainstream healthcare websites, three Chinese encyclopedia sites, and symptoms extracted from a larger number of EMRs as supplements. Methods Firstly, we design data schema manually by reference to the Unified Medical Language System (UMLS). Secondly, we extract entities from eight mainstream healthcare websites, which are fed as seeds to train a multi-class classifier and classify entities from encyclopedia sites and train a Conditional Random Field (CRF) model to extract symptoms from EMRs. Thirdly, we fuse data to solve the large-scale duplication between different data sources according to entity type alignment, entity mapping, and attribute mapping. Finally, we link our KB to UMLS to investigate similarities and differences between symptoms in Chinese and English. Conclusions As a result, the KB has more than 26,000 distinct symptoms in Chinese including 3968 symptoms in traditional Chinese medicine and 1029 synonym pairs for symptoms. The KB also includes concepts such as diseases and medicines as well as relations between symptoms and the above related entities. We also link our KB to the Unified Medical Language System and analyze the differences between symptoms in the two KBs. We released the KB as Linked Open Data and a demo at https://datahub.io/dataset/symptoms-in-chinese.
Collapse
Affiliation(s)
- Tong Ruan
- East China University of Science and Technology, Shanghai, China.
| | - Mengjie Wang
- East China University of Science and Technology, Shanghai, China
| | - Jian Sun
- East China University of Science and Technology, Shanghai, China
| | - Ting Wang
- East China University of Science and Technology, Shanghai, China
| | - Lu Zeng
- East China University of Science and Technology, Shanghai, China
| | - Yichao Yin
- Shanghai Shuguang Hospital, Shanghai, 200025, China
| | - Ju Gao
- Shanghai Shuguang Hospital, Shanghai, 200025, China
| |
Collapse
|
44
|
Automated extraction of potential migraine biomarkers using a semantic graph. J Biomed Inform 2017; 71:178-189. [PMID: 28579531 DOI: 10.1016/j.jbi.2017.05.018] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Revised: 04/03/2017] [Accepted: 05/23/2017] [Indexed: 01/20/2023]
Abstract
PROBLEM Biomedical literature and databases contain important clues for the identification of potential disease biomarkers. However, searching these enormous knowledge reservoirs and integrating findings across heterogeneous sources is costly and difficult. Here we demonstrate how semantically integrated knowledge, extracted from biomedical literature and structured databases, can be used to automatically identify potential migraine biomarkers. METHOD We used a knowledge graph containing more than 3.5 million biomedical concepts and 68.4 million relationships. Biochemical compound concepts were filtered and ranked by their potential as biomarkers based on their connections to a subgraph of migraine-related concepts. The ranked results were evaluated against the results of a systematic literature review that was performed manually by migraine researchers. Weight points were assigned to these reference compounds to indicate their relative importance. RESULTS Ranked results automatically generated by the knowledge graph were highly consistent with results from the manual literature review. Out of 222 reference compounds, 163 (73%) ranked in the top 2000, with 547 out of the 644 (85%) weight points assigned to the reference compounds. For reference compounds that were not in the top of the list, an extensive error analysis has been performed. When evaluating the overall performance, we obtained a ROC-AUC of 0.974. DISCUSSION Semantic knowledge graphs composed of information integrated from multiple and varying sources can assist researchers in identifying potential disease biomarkers.
Collapse
|
45
|
Semantic Health Knowledge Graph: Semantic Integration of Heterogeneous Medical Knowledge and Services. BIOMED RESEARCH INTERNATIONAL 2017; 2017:2858423. [PMID: 28299322 PMCID: PMC5337312 DOI: 10.1155/2017/2858423] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Revised: 11/28/2016] [Accepted: 12/22/2016] [Indexed: 11/29/2022]
Abstract
With the explosion of healthcare information, there has been a tremendous amount of heterogeneous textual medical knowledge (TMK), which plays an essential role in healthcare information systems. Existing works for integrating and utilizing the TMK mainly focus on straightforward connections establishment and pay less attention to make computers interpret and retrieve knowledge correctly and quickly. In this paper, we explore a novel model to organize and integrate the TMK into conceptual graphs. We then employ a framework to automatically retrieve knowledge in knowledge graphs with a high precision. In order to perform reasonable inference on knowledge graphs, we propose a contextual inference pruning algorithm to achieve efficient chain inference. Our algorithm achieves a better inference result with precision and recall of 92% and 96%, respectively, which can avoid most of the meaningless inferences. In addition, we implement two prototypes and provide services, and the results show our approach is practical and effective.
Collapse
|
46
|
Bai T, Gong L, Wang Y, Wang Y, Kulikowski CA, Huang L. A method for exploring implicit concept relatedness in biomedical knowledge network. BMC Bioinformatics 2016; 17 Suppl 9:265. [PMID: 27454167 PMCID: PMC4959351 DOI: 10.1186/s12859-016-1131-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical information and knowledge, structural and non-structural, stored in different repositories can be semantically connected to form a hybrid knowledge network. How to compute relatedness between concepts and discover valuable but implicit information or knowledge from it effectively and efficiently is of paramount importance for precision medicine, and a major challenge facing the biomedical research community. RESULTS In this study, a hybrid biomedical knowledge network is constructed by linking concepts across multiple biomedical ontologies as well as non-structural biomedical knowledge sources. To discover implicit relatedness between concepts in ontologies for which potentially valuable relationships (implicit knowledge) may exist, we developed a Multi-Ontology Relatedness Model (MORM) within the knowledge network, for which a relatedness network (RN) is defined and computed across multiple ontologies using a formal inference mechanism of set-theoretic operations. Semantic constraints are designed and implemented to prune the search space of the relatedness network. CONCLUSIONS Experiments to test examples of several biomedical applications have been carried out, and the evaluation of the results showed an encouraging potential of the proposed approach to biomedical knowledge discovery.
Collapse
Affiliation(s)
- Tian Bai
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| | - Leiguang Gong
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Yantai Intelligent Information Technologies Ltd., 2699 Qianjin St, Yantai, China
| | - Ye Wang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
| | - Yan Wang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| | - Casimir A. Kulikowski
- Department of Computer Science, Rutgers, The State University of New Jersey, 2699 Qianjin St, Piscataway, NJ USA
| | - Lan Huang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| |
Collapse
|
47
|
|