1
|
Wu J, Dong H, Li Z, Wang H, Li R, Patra A, Dai C, Ali W, Scordis P, Wu H. A hybrid framework with large language models for rare disease phenotyping. BMC Med Inform Decis Mak 2024; 24:289. [PMID: 39375687 PMCID: PMC11460004 DOI: 10.1186/s12911-024-02698-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 09/26/2024] [Indexed: 10/09/2024] Open
Abstract
PURPOSE Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based natural language processing (NLP) tools with large language models (LLMs) to improve rare disease identification from unstructured clinical reports. METHODS We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. SemEHR, a dictionary-based NLP tool, is employed to extract rare disease mentions from clinical notes. To refine the results and improve accuracy, we leverage various LLMs, including LLaMA3, Phi3-mini, and domain-specific models like OpenBioLLM and BioMistral. Different prompting strategies, such as zero-shot, few-shot, and knowledge-augmented generation, are explored to optimize the LLMs' performance. RESULTS The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. LLaMA3 and Phi3-mini achieve the highest F1 scores in rare disease identification. Few-shot prompting with 1-3 examples yields the best results, while knowledge-augmented generation shows limited improvement. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients. CONCLUSION The hybrid approach combining dictionary-based NLP tools with LLMs shows great promise for improving rare disease identification from unstructured clinical reports. By leveraging the strengths of both techniques, the method demonstrates superior performance and the potential to uncover hidden rare disease cases. Further research is needed to address limitations related to ontology mapping and overlapping case identification, and to integrate the approach into clinical practice for early diagnosis and improved patient outcomes.
Collapse
Affiliation(s)
- Jinge Wu
- Institute of Health Informatics, University College London, London, UK.
- UCB Pharma UK, Slough, UK.
| | - Hang Dong
- Department of Computer Science, University of Exeter, Exeter, UK
| | - Zexi Li
- The Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
| | - Haowei Wang
- Division of Medicine, University College London, London, UK
| | - Runci Li
- EGA- Institute for Women's Health, University College London, London, UK
| | | | | | | | | | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
- School of Health and Wellbeing, University of Glasgow, Glasgow, UK.
| |
Collapse
|
2
|
Barakat A, Munro G, Heegaard AM. Finding new analgesics: Computational pharmacology faces drug discovery challenges. Biochem Pharmacol 2024; 222:116091. [PMID: 38412924 DOI: 10.1016/j.bcp.2024.116091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 01/10/2024] [Accepted: 02/23/2024] [Indexed: 02/29/2024]
Abstract
Despite the worldwide prevalence and huge burden of pain, pain is an undertreated phenomenon. Currently used analgesics have several limitations regarding their efficacy and safety. The discovery of analgesics possessing a novel mechanism of action has faced multiple challenges, including a limited understanding of biological processes underpinning pain and analgesia and poor animal-to-human translation. Computational pharmacology is currently employed to face these challenges. In this review, we discuss the theory, methods, and applications of computational pharmacology in pain research. Computational pharmacology encompasses a wide variety of theoretical concepts and practical methodological approaches, with the overall aim of gaining biological insight through data acquisition and analysis. Data are acquired from patients or animal models with pain or analgesic treatment, at different levels of biological organization (molecular, cellular, physiological, and behavioral). Distinct methodological algorithms can then be used to analyze and integrate data. This helps to facilitate the identification of biological molecules and processes associated with pain phenotype, build quantitative models of pain signaling, and extract translatable features between humans and animals. However, computational pharmacology has several limitations, and its predictions can provide false positive and negative findings. Therefore, computational predictions are required to be validated experimentally before drawing solid conclusions. In this review, we discuss several case study examples of combining and integrating computational tools with experimental pain research tools to meet drug discovery challenges.
Collapse
Affiliation(s)
- Ahmed Barakat
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark; Department of Pharmacology and Toxicology, Faculty of Pharmacy, Assiut University, Assiut, Egypt.
| | | | - Anne-Marie Heegaard
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
3
|
Raveau MP, Goñi JI, Rodríguez JF, Paiva-Mack I, Barriga F, Hermosilla MP, Fuentes-Bravo C, Eyheramendy S. Natural language processing analysis of the psychosocial stressors of mental health disorders during the pandemic. NPJ MENTAL HEALTH RESEARCH 2023; 2:17. [PMID: 38609516 PMCID: PMC10955824 DOI: 10.1038/s44184-023-00039-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 09/21/2023] [Indexed: 04/14/2024]
Abstract
Over the past few years, the COVID-19 pandemic has exerted various impacts on the world, notably concerning mental health. Nevertheless, the precise influence of psychosocial stressors on this mental health crisis remains largely unexplored. In this study, we employ natural language processing to examine chat text from a mental health helpline. The data was obtained from a chat helpline called Safe Hour from the "It Gets Better" project in Chile. This dataset encompass 10,986 conversations between trained professional volunteers from the foundation and platform users from 2018 to 2020. Our analysis shows a significant increase in conversations covering issues of self-image and interpersonal relations, as well as a decrease in performance themes. Also, we observe that conversations involving themes like self-image and emotional crisis played a role in explaining both suicidal behavior and depressive symptoms. However, anxious symptoms can only be explained by emotional crisis themes. These findings shed light on the intricate connections between psychosocial stressors and various mental health aspects in the context of the COVID-19 pandemic.
Collapse
Affiliation(s)
| | - Julián I Goñi
- DILAB, Facultad de Ingeniería, Pontificia Universidad Católica de Chile, Santiago, Chile
- Science, Technology, and Innovation Studies, The University of Edinburgh, Edinburgh, Scotland
| | - José F Rodríguez
- Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Santiago, Chile
| | - Isidora Paiva-Mack
- Escuela de Psicología, Universidad Adolfo Ibáñez, Santiago, Chile
- GobLab, Escuela de Gobierno, Universidad Adolfo Ibáñez, Santiago, Chile
| | | | | | | | - Susana Eyheramendy
- Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Santiago, Chile
| |
Collapse
|
4
|
Feng Z, Shen Z, Li H, Li S. e-TSN: an interactive visual exploration platform for target-disease knowledge mapping from literature. Brief Bioinform 2022; 23:bbac465. [PMID: 36347537 PMCID: PMC9677481 DOI: 10.1093/bib/bbac465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/20/2022] [Accepted: 09/27/2022] [Indexed: 11/10/2022] Open
Abstract
Target discovery and identification processes are driven by the increasing amount of biomedical data. The vast numbers of unstructured texts of biomedical publications provide a rich source of knowledge for drug target discovery research and demand the development of specific algorithms or tools to facilitate finding disease genes and proteins. Text mining is a method that can automatically mine helpful information related to drug target discovery from massive biomedical literature. However, there is a substantial lag between biomedical publications and the subsequent abstraction of information extracted by text mining to databases. The knowledge graph is introduced to integrate heterogeneous biomedical data. Here, we describe e-TSN (Target significance and novelty explorer, http://www.lilab-ecust.cn/etsn/), a knowledge visualization web server integrating the largest database of associations between targets and diseases from the full scientific literature by constructing significance and novelty scoring methods based on bibliometric statistics. The platform aims to visualize target-disease knowledge graphs to assist in prioritizing candidate disease-related proteins. Approved drugs and associated bioactivities for each interested target are also provided to facilitate the visualization of drug-target relationships. In summary, e-TSN is a fast and customizable visualization resource for investigating and analyzing the intricate target-disease networks, which could help researchers understand the mechanisms underlying complex disease phenotypes and improve the drug discovery and development efficiency, especially for the unexpected outbreak of infectious disease pandemics like COVID-19.
Collapse
Affiliation(s)
- Ziyan Feng
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Zihao Shen
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
- Lingang Laboratory, Shanghai 200031, China
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| |
Collapse
|
5
|
Singh G, Papoutsoglou EA, Keijts-Lalleman F, Vencheva B, Rice M, Visser RG, Bachem CW, Finkers R. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC PLANT BIOLOGY 2021; 21:198. [PMID: 33894758 PMCID: PMC8070292 DOI: 10.1186/s12870-021-02943-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 03/29/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. RESULTS We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. CONCLUSIONS Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.
Collapse
Affiliation(s)
- Gurnoor Singh
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | | | | | | | - Mark Rice
- IBM Netherlands, Amsterdam, The Netherlands
| | - Richard G.F. Visser
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Christian W.B. Bachem
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| |
Collapse
|
6
|
Ji X, Tan W, Zhang C, Zhai Y, Hsueh Y, Zhang Z, Zhang C, Lu Y, Duan B, Tan G, Na R, Deng G, Niu G. TWIRLS, a knowledge-mining technology, suggests a possible mechanism for the pathological changes in the human host after coronavirus infection via ACE2. Drug Dev Res 2020; 81:1004-1018. [PMID: 32657473 PMCID: PMC7404951 DOI: 10.1002/ddr.21717] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 06/05/2020] [Accepted: 06/27/2020] [Indexed: 12/12/2022]
Abstract
Faced with the current large-scale public health emergency, collecting, sorting, and analyzing biomedical information related to the "SARS-CoV-2" should be done as quickly as possible to gain a global perspective, which is a basic requirement for strengthening epidemic control capacity. However, for human researchers studying viruses and hosts, the vast amount of information available cannot be processed effectively and in a timely manner, particularly if our scientific understanding is also limited, which further lowers the information processing efficiency. We present TWIRLS (Topic-wise inference engine of massive biomedical literatures), a method that can deal with various scientific problems, such as liver cancer, acute myeloid leukemia, and so forth, which can automatically acquire, organize, and classify information. Additionally, this information can be combined with independent functional data sources to build an inference system via a machine-based approach, which can provide relevant knowledge to help human researchers quickly establish subject cognition and to make more effective decisions. Using TWIRLS, we automatically analyzed more than three million words in more than 14,000 literature articles in only 4 hr. We found that an important regulatory factor angiotensin-converting enzyme 2 (ACE2) may be involved in host pathological changes on binding to the coronavirus after infection. On triggering functional changes in ACE2/AT2R, the cytokine homeostasis regulation axis becomes imbalanced via the Renin-Angiotensin System and IP-10, leading to a cytokine storm. Through a preliminary analysis of blood indices of COVID-19 patients with a history of hypertension, we found that non-ARB (Angiotensin II receptor blockers) users had more symptoms of severe illness than ARB users. This suggests ARBs could potentially be used to treat acute lung injury caused by coronavirus infection.
Collapse
Affiliation(s)
- Xiaoyang Ji
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Inner Mongolia Autonomous Region College of Animal ScienceInner Mongolia Agricultural UniversityHohhotChina
- Joint Turing‐Darwin Laboratory of Phil Rivers Technology Ltd. and Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- Department of Computational Biology, Phil Rivers Technology LtdBeijingChina
| | - Wenting Tan
- Department of Infectious DiseasesSouthwest Hospital, Third Military Medical University (Army Medical University)ChongqingChina
| | - Chunming Zhang
- Joint Turing‐Darwin Laboratory of Phil Rivers Technology Ltd. and Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- Department of Computational Biology, Phil Rivers Technology LtdBeijingChina
- Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- West Institute of Computing TechnologyChinese Academy of SciencesChongqingChina
| | - Yubo Zhai
- Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- University of Chinese Academy of SciencesBeijingChina
| | - Yiching Hsueh
- Joint Turing‐Darwin Laboratory of Phil Rivers Technology Ltd. and Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- Department of Computational Biology, Phil Rivers Technology LtdBeijingChina
| | - Zhonghai Zhang
- Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
| | - Chunli Zhang
- Department of Computational Biology, Phil Rivers Technology LtdBeijingChina
| | - Yanqiu Lu
- Department of Infectious DiseasesChongqing Public Health Medical CenterChongqingChina
| | - Bo Duan
- Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- West Institute of Computing TechnologyChinese Academy of SciencesChongqingChina
| | - Guangming Tan
- Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- West Institute of Computing TechnologyChinese Academy of SciencesChongqingChina
| | - Renhua Na
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Inner Mongolia Autonomous Region College of Animal ScienceInner Mongolia Agricultural UniversityHohhotChina
| | - Guohong Deng
- Department of Infectious DiseasesSouthwest Hospital, Third Military Medical University (Army Medical University)ChongqingChina
| | - Gang Niu
- Joint Turing‐Darwin Laboratory of Phil Rivers Technology Ltd. and Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
- Department of Computational Biology, Phil Rivers Technology LtdBeijingChina
- West Institute of Computing TechnologyChinese Academy of SciencesChongqingChina
| |
Collapse
|
7
|
Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining. JOURNAL OF HEALTHCARE ENGINEERING 2020; 2020:8829219. [PMID: 33299537 PMCID: PMC7707942 DOI: 10.1155/2020/8829219] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 10/26/2020] [Accepted: 11/02/2020] [Indexed: 12/19/2022]
Abstract
Background Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is difficult to obtain. Methods Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. The model used word embedding from domain corpus and fine-tuning of entity recognition model pretrained by relevant corpus. Then BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. Results 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. These experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition effect. Conclusions These experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition performance.
Collapse
|
8
|
Levin JM, Oprea TI, Davidovich S, Clozel T, Overington JP, Vanhaelen Q, Cantor CR, Bischof E, Zhavoronkov A. Artificial intelligence, drug repurposing and peer review. Nat Biotechnol 2020; 38:1127-1131. [DOI: 10.1038/s41587-020-0686-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|