1
|
Liu Q, Tian Y, Zhou T, Lyu K, Wang Z, Zheng Y, Liu Y, Ren J, Li J. An Explainable and Personalized Cognitive Reasoning Model Based on Knowledge Graph: Toward Decision Making for General Practice. IEEE J Biomed Health Inform 2024; 28:707-718. [PMID: 37669206 DOI: 10.1109/jbhi.2023.3312154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
General practice plays a prominent role in primary health care (PHC). However, evidence has shown that the quality of PHC is still unsatisfactory, and the accuracy of clinical diagnosis and treatment must be improved in China. Decision making tools based on artificial intelligence can help general practitioners diagnose diseases, but most existing research is not sufficiently scalable and explainable. An explainable and personalized cognitive reasoning model based on knowledge graph (CRKG) proposed in this article can provide personalized diagnosis, perform decision making in general practice, and simulate the mode of thinking of human beings utilizing patients' electronic health records (EHRs) and knowledge graph. Taking abdominal diseases as the application point, an abdominal disease knowledge graph is first constructed in a semiautomated manner. Then, the CRKG designed referring to dual process theory in cognitive science involves the update strategy of global graph representations and reasoning on a personal cognitive graph by adopting the idea of graph neural networks and attention mechanisms. For the diagnosis of diseases in general practice, the CRKG outperforms all the baselines with a precision@1 of 0.7873, recall@10 of 0.9020 and hits@10 of 0.9340. Additionally, the visualization of the reasoning process for each visit of a patient based on the knowledge graph enhances clinicians' comprehension and contributes to explainability. This study is of great importance for the exploration and application of decision making based on EHRs and knowledge graph.
Collapse
|
2
|
Pacheco JA, Rasmussen LV, Wiley K, Person TN, Cronkite DJ, Sohn S, Murphy S, Gundelach JH, Gainer V, Castro VM, Liu C, Mentch F, Lingren T, Sundaresan AS, Eickelberg G, Willis V, Furmanchuk A, Patel R, Carrell DS, Deng Y, Walton N, Satterfield BA, Kullo IJ, Dikilitas O, Smith JC, Peterson JF, Shang N, Kiryluk K, Ni Y, Li Y, Nadkarni GN, Rosenthal EA, Walunas TL, Williams MS, Karlson EW, Linder JE, Luo Y, Weng C, Wei W. Evaluation of the portability of computable phenotypes with natural language processing in the eMERGE network. Sci Rep 2023; 13:1971. [PMID: 36737471 PMCID: PMC9898520 DOI: 10.1038/s41598-023-27481-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 01/03/2023] [Indexed: 02/05/2023] Open
Abstract
The electronic Medical Records and Genomics (eMERGE) Network assessed the feasibility of deploying portable phenotype rule-based algorithms with natural language processing (NLP) components added to improve performance of existing algorithms using electronic health records (EHRs). Based on scientific merit and predicted difficulty, eMERGE selected six existing phenotypes to enhance with NLP. We assessed performance, portability, and ease of use. We summarized lessons learned by: (1) challenges; (2) best practices to address challenges based on existing evidence and/or eMERGE experience; and (3) opportunities for future research. Adding NLP resulted in improved, or the same, precision and/or recall for all but one algorithm. Portability, phenotyping workflow/process, and technology were major themes. With NLP, development and validation took longer. Besides portability of NLP technology and algorithm replicability, factors to ensure success include privacy protection, technical infrastructure setup, intellectual property agreement, and efficient communication. Workflow improvements can improve communication and reduce implementation time. NLP performance varied mainly due to clinical document heterogeneity; therefore, we suggest using semi-structured notes, comprehensive documentation, and customization options. NLP portability is possible with improved phenotype algorithm performance, but careful planning and architecture of the algorithms is essential to support local customizations.
Collapse
Affiliation(s)
| | | | - Ken Wiley
- National Human Genome Research Institute, Bethesda, USA
| | | | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, USA
| | | | | | | | | | | | - Cong Liu
- Columbia University, New York, USA
| | - Frank Mentch
- Children's Hospital of Philadelphia, Philadelphia, USA
| | - Todd Lingren
- Cincinnati Children's Hospital Medical Center, Cincinnati, USA
| | | | | | | | | | | | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, USA
| | - Yu Deng
- Northwestern University, Evanston, USA
| | | | | | | | | | | | | | | | | | - Yizhao Ni
- Cincinnati Children's Hospital Medical Center, Cincinnati, USA
| | - Yikuan Li
- Northwestern University, Evanston, USA
| | | | | | | | | | | | | | - Yuan Luo
- Northwestern University, Evanston, USA
| | | | - WeiQi Wei
- Vanderbilt University Medical Center, Nashville, USA
| |
Collapse
|
3
|
Ramgopal S, Sanchez-Pinto LN, Horvat CM, Carroll MS, Luo Y, Florin TA. Artificial intelligence-based clinical decision support in pediatrics. Pediatr Res 2023; 93:334-341. [PMID: 35906317 PMCID: PMC9668209 DOI: 10.1038/s41390-022-02226-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/29/2022] [Accepted: 07/18/2022] [Indexed: 11/24/2022]
Abstract
Machine learning models may be integrated into clinical decision support (CDS) systems to identify children at risk of specific diagnoses or clinical deterioration to provide evidence-based recommendations. This use of artificial intelligence models in clinical decision support (AI-CDS) may have several advantages over traditional "rule-based" CDS models in pediatric care through increased model accuracy, with fewer false alerts and missed patients. AI-CDS tools must be appropriately developed, provide insight into the rationale behind decisions, be seamlessly integrated into care pathways, be intuitive to use, answer clinically relevant questions, respect the content expertise of the healthcare provider, and be scientifically sound. While numerous machine learning models have been reported in pediatric care, their integration into AI-CDS remains incompletely realized to date. Important challenges in the application of AI models in pediatric care include the relatively lower rates of clinically significant outcomes compared to adults, and the lack of sufficiently large datasets available necessary for the development of machine learning models. In this review article, we summarize key concepts related to AI-CDS, its current application to pediatric care, and its potential benefits and risks. IMPACT: The performance of clinical decision support may be enhanced by the utilization of machine learning-based algorithms to improve the predictive performance of underlying models. Artificial intelligence-based clinical decision support (AI-CDS) uses models that are experientially improved through training and are particularly well suited toward high-dimensional data. The application of AI-CDS toward pediatric care remains limited currently but represents an important area of future research.
Collapse
Affiliation(s)
- Sriram Ramgopal
- Division of Emergency Medicine, Ann & Robert H. Lurie Children's Hospital of Chicago, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
| | - L. Nelson Sanchez-Pinto
- grid.16753.360000 0001 2299 3507Division of Critical Care Medicine, Ann & Robert H. Lurie Children’s Hospital of Chicago, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Chicago, IL USA ,grid.16753.360000 0001 2299 3507Department of Preventive Medicine (Health and Biomedical Informatics), Feinberg School of Medicine, Northwestern University, Chicago, IL USA
| | - Christopher M. Horvat
- grid.21925.3d0000 0004 1936 9000Department of Critical Care Medicine, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh School of Medicine, Pittsburgh, PA USA
| | - Michael S. Carroll
- grid.16753.360000 0001 2299 3507Data Analytics and Reporting, Ann & Robert H. Lurie Children’s Hospital of Chicago, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| | - Yuan Luo
- grid.16753.360000 0001 2299 3507Department of Preventive Medicine (Health and Biomedical Informatics), Feinberg School of Medicine, Northwestern University, Chicago, IL USA
| | - Todd A. Florin
- grid.16753.360000 0001 2299 3507Division of Emergency Medicine, Ann & Robert H. Lurie Children’s Hospital of Chicago, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| |
Collapse
|
4
|
French E, McInnes BT. An overview of biomedical entity linking throughout the years. J Biomed Inform 2023; 137:104252. [PMID: 36464228 PMCID: PMC9845184 DOI: 10.1016/j.jbi.2022.104252] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 09/19/2022] [Accepted: 11/15/2022] [Indexed: 12/04/2022]
Abstract
Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.
Collapse
Affiliation(s)
- Evan French
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
5
|
Kline A, Wang H, Li Y, Dennis S, Hutch M, Xu Z, Wang F, Cheng F, Luo Y. Multimodal machine learning in precision health: A scoping review. NPJ Digit Med 2022; 5:171. [PMID: 36344814 PMCID: PMC9640667 DOI: 10.1038/s41746-022-00712-8] [Citation(s) in RCA: 65] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 10/14/2022] [Indexed: 11/09/2022] Open
Abstract
Machine learning is frequently being leveraged to tackle problems in the health sector including utilization for clinical decision-support. Its use has historically been focused on single modal data. Attempts to improve prediction and mimic the multimodal nature of clinical expert decision-making has been met in the biomedical field of machine learning by fusing disparate data. This review was conducted to summarize the current studies in this field and identify topics ripe for future research. We conducted this review in accordance with the PRISMA extension for Scoping Reviews to characterize multi-modal data fusion in health. Search strings were established and used in databases: PubMed, Google Scholar, and IEEEXplore from 2011 to 2021. A final set of 128 articles were included in the analysis. The most common health areas utilizing multi-modal methods were neurology and oncology. Early fusion was the most common data merging strategy. Notably, there was an improvement in predictive performance when using data fusion. Lacking from the papers were clear clinical deployment strategies, FDA-approval, and analysis of how using multimodal approaches from diverse sub-populations may improve biases and healthcare disparities. These findings provide a summary on multimodal data fusion as applied to health diagnosis/prognosis problems. Few papers compared the outputs of a multimodal approach with a unimodal prediction. However, those that did achieved an average increase of 6.4% in predictive accuracy. Multi-modal machine learning, while more robust in its estimations over unimodal methods, has drawbacks in its scalability and the time-consuming nature of information concatenation.
Collapse
Affiliation(s)
- Adrienne Kline
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Hanyin Wang
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Yikuan Li
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Saya Dennis
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Meghan Hutch
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Zhenxing Xu
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Fei Wang
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Feixiong Cheng
- Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, 44195, OH, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA.
| |
Collapse
|
6
|
Lee Y, Son J, Song M. BertSRC: transformer-based semantic relation classification. BMC Med Inform Decis Mak 2022; 22:234. [PMID: 36068535 PMCID: PMC9446816 DOI: 10.1186/s12911-022-01977-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 08/11/2022] [Indexed: 11/13/2022] Open
Abstract
The relationship between biomedical entities is complex, and many of them have not yet been identified. For many biomedical research areas including drug discovery, it is of paramount importance to identify the relationships that have already been established through a comprehensive literature survey. However, manually searching through literature is difficult as the amount of biomedical publications continues to increase. Therefore, the relation classification task, which automatically mines meaningful relations from the literature, is spotlighted in the field of biomedical text mining. By applying relation classification techniques to the accumulated biomedical literature, existing semantic relations between biomedical entities that can help to infer previously unknown relationships are efficiently grasped. To develop semantic relation classification models, which is a type of supervised machine learning, it is essential to construct a training dataset that is manually annotated by biomedical experts with semantic relations among biomedical entities. Any advanced model must be trained on a dataset with reliable quality and meaningful scale to be deployed in the real world and can assist biologists in their research. In addition, as the number of such public datasets increases, the performance of machine learning algorithms can be accurately revealed and compared by using those datasets as a benchmark for model development and improvement. In this paper, we aim to build such a dataset. Along with that, to validate the usability of the dataset as training data for relation classification models and to improve the performance of the relation extraction task, we built a relation classification model based on Bidirectional Encoder Representations from Transformers (BERT) trained on our dataset, applying our newly proposed fine-tuning methodology. In experiments comparing performance among several models based on different deep learning algorithms, our model with the proposed fine-tuning methodology showed the best performance. The experimental results show that the constructed training dataset is an important information resource for the development and evaluation of semantic relation extraction models. Furthermore, relation extraction performance can be improved by integrating our proposed fine-tuning methodology. Therefore, this can lead to the promotion of future text mining research in the biomedical field.
Collapse
Affiliation(s)
- Yeawon Lee
- Department of Library and Information Science, Yonsei University, Seoul, South Korea
| | - Jinseok Son
- Department of Digital Analytics, Yonsei University, Seoul, South Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, South Korea.
| |
Collapse
|
7
|
Yue T, He Z, Li C, Hu Z, Li Y. Lightweight fine-grained classification for scientific paper. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-213022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The number of scientific papers has been increasing ever more rapidly. Researchers have to spend a lot of time classifying papers relevant to their study, especially into fine-grained subfields. However, almost all existing paper classification models are coarse-grained, which can not meet the needs of researchers. Observing this, we propose a lightweight fine-grained classification model for scientific paper. Dynamic weighting coefficients on feature words are incorporated into the model to improve the classification accuracy. The feature word weight is optimized by the Mean Decrease Accuracy (MDA) algorithm. Considering applicability, the lightweight processing is conducted through algorithm pruning and training sample pruning. Comparison with mainstream models shows simultaneous improvement in accuracy and time efficiency by our model.
Collapse
Affiliation(s)
- Tan Yue
- School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China
| | - Zihang He
- School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China
| | - Chang Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Zonghai Hu
- School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China
| | - Yong Li
- School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China
| |
Collapse
|
8
|
Deng Z, Sun C, Zhong G, Mao Y. Text Classification with Attention Gated Graph Neural Network. Cognit Comput 2022. [DOI: 10.1007/s12559-022-10017-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
9
|
Entity understanding with hierarchical graph learning for enhanced text classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108576] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
10
|
Devi R, Mehrotra D, Lamine SBAB. Constituent vs Dependency Parsing-Based RDF Model Generation from Dengue Patients’ Case Sheets. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2022. [DOI: 10.1142/s0219649222500137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Electronic Health Record (EHR) systems in healthcare organisations are primarily maintained in isolation from each other that makes interoperability of unstructured(text) data stored in these EHR systems challenging in the healthcare domain. Similar information may be described using different terminologies by different applications that can be evaded by transforming the content into the Resource Description Framework (RDF) model that is interoperable amongst organisations. RDF requires a document’s contents to be translated into a repository of triplets (subject, predicate, object) known as RDF statements. Natural Language Processing (NLP) techniques can help get actionable insights from these text data and create triplets for RDF model generation. This paper discusses two NLP-based approaches to generate the RDF models from unstructured patients’ documents, namely dependency structure-based and constituent(phrase) structure-based parser. Models generated by both approaches are evaluated in two aspects: exhaustiveness of the represented knowledge and the model generation time. The precision measure is used to compute the models’ exhaustiveness in terms of the number of facts that are transformed into RDF representations.
Collapse
Affiliation(s)
- Runumi Devi
- School of Computing Science and Engineering, Galgotias University, Yamuna Expressway, Greater Noida, Gautam Buddh Nagar, Uttar Pradesh, India
- Amity Institute of Information Technology, Amity University Uttar Pradesh, Noida, India
| | - Deepti Mehrotra
- Amity School of Engineering and Technology, Amity University Uttar Pradesh, Noida, India
| | | |
Collapse
|
11
|
Zhao Y, Yu Y, Wang H, Li Y, Deng Y, Jiang G, Luo Y. Machine Learning in Causal Inference: Application in Pharmacovigilance. Drug Saf 2022; 45:459-476. [PMID: 35579811 PMCID: PMC9114053 DOI: 10.1007/s40264-022-01155-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/09/2022] [Indexed: 01/28/2023]
Abstract
Monitoring adverse drug events or pharmacovigilance has been promoted by the World Health Organization to assure the safety of medicines through a timely and reliable information exchange regarding drug safety issues. We aim to discuss the application of machine learning methods as well as causal inference paradigms in pharmacovigilance. We first reviewed data sources for pharmacovigilance. Then, we examined traditional causal inference paradigms, their applications in pharmacovigilance, and how machine learning methods and causal inference paradigms were integrated to enhance the performance of traditional causal inference paradigms. Finally, we summarized issues with currently mainstream correlation-based machine learning models and how the machine learning community has tried to address these issues by incorporating causal inference paradigms. Our literature search revealed that most existing data sources and tasks for pharmacovigilance were not designed for causal inference. Additionally, pharmacovigilance was lagging in adopting machine learning-causal inference integrated models. We highlight several currently trending directions or gaps to integrate causal inference with machine learning in pharmacovigilance research. Finally, our literature search revealed that the adoption of causal paradigms can mitigate known issues with machine learning models. We foresee that the pharmacovigilance domain can benefit from the progress in the machine learning field.
Collapse
Affiliation(s)
- Yiqing Zhao
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yue Yu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, 55902, USA
| | - Hanyin Wang
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yikuan Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Yu Deng
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, 55902, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 750 N Lake Shore Drive, Room 11-189, Chicago, IL, 60611, USA.
| |
Collapse
|
12
|
Zhang J, Huang W, Ji D, Ren Y. Globally normalized neural model for joint entity and event extraction. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102636] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
Weinzierl MA, Maldonado R, Harabagiu SM. The impact of learning Unified Medical Language System knowledge embeddings in relation extraction from biomedical texts. J Am Med Inform Assoc 2021; 27:1556-1567. [PMID: 33029619 DOI: 10.1093/jamia/ocaa205] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 07/23/2020] [Accepted: 08/07/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE We explored how knowledge embeddings (KEs) learned from the Unified Medical Language System (UMLS) Metathesaurus impact the quality of relation extraction on 2 diverse sets of biomedical texts. MATERIALS AND METHODS Two forms of KEs were learned for concepts and relation types from the UMLS Metathesaurus, namely lexicalized knowledge embeddings (LKEs) and unlexicalized KEs. A knowledge embedding encoder (KEE) enabled learning either LKEs or unlexicalized KEs as well as neural models capable of producing LKEs for mentions of biomedical concepts in texts and relation types that are not encoded in the UMLS Metathesaurus. This allowed us to design the relation extraction with knowledge embeddings (REKE) system, which incorporates either LKEs or unlexicalized KEs produced for relation types of interest and their arguments. RESULTS The incorporation of either LKEs or unlexicalized KE in REKE advances the state of the art in relation extraction on 2 relation extraction datasets: the 2010 i2b2/VA dataset and the 2013 Drug-Drug Interaction Extraction Challenge corpus. Moreover, the impact of LKEs is superior, achieving F1 scores of 78.2 and 82.0, respectively. DISCUSSION REKE not only highlights the importance of incorporating knowledge encoded in the UMLS Metathesaurus in a novel way, through 2 possible forms of KEs, but it also showcases the subtleties of incorporating KEs in relation extraction systems. CONCLUSIONS Incorporating LKEs informed by the UMLS Metathesaurus in a relation extraction system operating on biomedical texts shows significant promise. We present the REKE system, which establishes new state-of-the-art results for relation extraction on 2 datasets when using LKEs.
Collapse
Affiliation(s)
- Maxwell A Weinzierl
- Human Language Technology Research Institute, Department of Computer Science, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA
| | - Ramon Maldonado
- Human Language Technology Research Institute, Department of Computer Science, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA
| | - Sanda M Harabagiu
- Human Language Technology Research Institute, Department of Computer Science, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA
| |
Collapse
|
14
|
Tran T, Kavuluru R, Kilicoglu H. Attention-Gated Graph Convolutions for Extracting Drug Interaction Information from Drug Labels. ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE 2021; 2:10. [PMID: 34541578 PMCID: PMC8445229 DOI: 10.1145/3423209] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 09/01/2020] [Indexed: 01/02/2023]
Abstract
Preventable adverse events as a result of medical errors present a growing concern in the healthcare system. As drug-drug interactions (DDIs) may lead to preventable adverse events, being able to extract DDIs from drug labels into a machine-processable form is an important step toward effective dissemination of drug safety information. Herein, we tackle the problem of jointly extracting mentions of drugs and their interactions, including interaction outcome, from drug labels. Our deep learning approach entails composing various intermediate representations, including graph-based context derived using graph convolutions (GCs) with a novel attention-based gating mechanism (holistically called GCA), which are combined in meaningful ways to predict on all subtasks jointly. Our model is trained and evaluated on the 2018 TAC DDI corpus. Our GCA model in conjunction with transfer learning performs at 39.20% F1 and 26.09% F1 on entity recognition (ER) and relation extraction (RE), respectively, on the first official test set and at 45.30% F1 and 27.87% F1 on ER and RE, respectively, on the second official test set. These updated results lead to improvements over our prior best by up to 6 absolute F1 points. After controlling for available training data, the proposed model exhibits state-of-the-art performance for this task.
Collapse
Affiliation(s)
- Tung Tran
- University of Kentucky, United States
| | | | | |
Collapse
|
15
|
Shi X, Yi Y, Xiong Y, Tang B, Chen Q, Wang X, Ji Z, Zhang Y, Xu H. Extracting entities with attributes in clinical text via joint deep learning. J Am Med Inform Assoc 2021; 26:1584-1591. [PMID: 31550346 DOI: 10.1093/jamia/ocz158] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2019] [Revised: 07/18/2019] [Accepted: 08/15/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE Extracting clinical entities and their attributes is a fundamental task of natural language processing (NLP) in the medical domain. This task is typically recognized as 2 sequential subtasks in a pipeline, clinical entity or attribute recognition followed by entity-attribute relation extraction. One problem of pipeline methods is that errors from entity recognition are unavoidably passed to relation extraction. We propose a novel joint deep learning method to recognize clinical entities or attributes and extract entity-attribute relations simultaneously. MATERIALS AND METHODS The proposed method integrates 2 state-of-the-art methods for named entity recognition and relation extraction, namely bidirectional long short-term memory with conditional random field and bidirectional long short-term memory, into a unified framework. In this method, relation constraints between clinical entities and attributes and weights of the 2 subtasks are also considered simultaneously. We compare the method with other related methods (ie, pipeline methods and other joint deep learning methods) on an existing English corpus from SemEval-2015 and a newly developed Chinese corpus. RESULTS Our proposed method achieves the best F1 of 74.46% on entity recognition and the best F1 of 50.21% on relation extraction on the English corpus, and 89.32% and 88.13% on the Chinese corpora, respectively, which outperform the other methods on both tasks. CONCLUSIONS The joint deep learning-based method could improve both entity recognition and relation extraction from clinical text in both English and Chinese, indicating that the approach is promising.
Collapse
Affiliation(s)
- Xue Shi
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Yingping Yi
- Department of Science and Education, The Second Affiliated Hospital of Nanchang University, Nanchang, China
| | - Ying Xiong
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.,Peng Cheng Laboratory
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Xiaolong Wang
- Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Zongcheng Ji
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
16
|
Sousa D, Lamurias A, Couto FM. Using Neural Networks for Relation Extraction from Biomedical Literature. Methods Mol Biol 2021; 2190:289-305. [PMID: 32804372 DOI: 10.1007/978-1-0716-0826-5_14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| | - Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Francisco M Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
17
|
Zhang L, Hu J, Xu Q, Li F, Rao G, Tao C. A semantic relationship mining method among disorders, genes, and drugs from different biomedical datasets. BMC Med Inform Decis Mak 2020; 20:283. [PMID: 33317518 PMCID: PMC7734713 DOI: 10.1186/s12911-020-01274-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 09/22/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Semantic web technology has been applied widely in the biomedical informatics field. Large numbers of biomedical datasets are available online in the resource description framework (RDF) format. Semantic relationship mining among genes, disorders, and drugs is widely used in, for example, precision medicine and drug repositioning. However, most of the existing studies focused on a single dataset. It is not easy to find the most current relationships among disorder-gene-drug relationships since the relationships are distributed in heterogeneous datasets. How to mine their semantic relationships from different biomedical datasets is an important issue. METHODS First, a variety of biomedical datasets were converted into RDF triple data; then, multisource biomedical datasets were integrated into a storage system using a data integration algorithm. Second, nine query patterns among genes, disorders, and drugs from different biomedical datasets were designed. Third, the gene-disorder-drug semantic relationship mining algorithm is presented. This algorithm can query the relationships among various entities from different datasets. RESULTS AND CONCLUSIONS We focused on mining the putative and the most current disorder-gene-drug relationships about Parkinson's disease (PD). The results demonstrate that our method has significant advantages in mining and integrating multisource heterogeneous biomedical datasets. Twenty-five new relationships among the genes, disorders, and drugs were mined from four different datasets. The query results showed that most of them came from different datasets. The precision of the method increased by 2.51% compared to that of the multisource linked open data fusion method presented in the 4th International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). Moreover, the number of query results increased by 7.7%, and the number of correct queries increased by 9.5%.
Collapse
Affiliation(s)
- Li Zhang
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Jiamei Hu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Qianzhi Xu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Fang Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Guozheng Rao
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350 China
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, 300350 China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| |
Collapse
|
18
|
Abdulkadhar S, Bhasuran B, Natarajan J. Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst 2020. [DOI: 10.1007/s10115-020-01514-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
19
|
Yan S, Wong KC. Context awareness and embedding for biomedical event extraction. Bioinformatics 2020; 36:637-643. [PMID: 31392318 DOI: 10.1093/bioinformatics/btz607] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 07/26/2019] [Accepted: 08/06/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from the literature. Limited by the event context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management. RESULTS We consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bidirectional long short-term memory network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets, respectively. Comparing with seven state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 versus 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons. AVAILABILITY AND IMPLEMENTATION https://github.com/cskyan/evntextrc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shankai Yan
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| |
Collapse
|
20
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
21
|
Peterson KJ, Jiang G, Liu H. A corpus-driven standardization framework for encoding clinical problems with HL7 FHIR. J Biomed Inform 2020; 110:103541. [PMID: 32814201 DOI: 10.1016/j.jbi.2020.103541] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 08/09/2020] [Accepted: 08/13/2020] [Indexed: 01/17/2023]
Abstract
Free-text problem descriptions are brief explanations of patient diagnoses and issues, commonly found in problem lists and other prominent areas of the medical record. These compact representations often express complex and nuanced medical conditions, making their semantics challenging to fully capture and standardize. In this study, we describe a framework for transforming free-text problem descriptions into standardized Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) models. This approach leverages a combination of domain-specific dependency parsers, Bidirectional Encoder Representations from Transformers (BERT) natural language models, and cui2vec Unified Medical Language System (UMLS) concept vectors to align extracted concepts from free-text problem descriptions into structured FHIR models. A neural network classification model is used to classify thirteen relationship types between concepts, facilitating mapping to the FHIR Condition resource. We use data programming, a weak supervision approach, to eliminate the need for a manually annotated training corpus. Shapley values, a mechanism to quantify contribution, are used to interpret the impact of model features. We found that our methods identified the focus concept, or primary clinical concern of the problem description, with an F1 score of 0.95. Relationships from the focus to other modifying concepts were extracted with an F1 score of 0.90. When classifying relationships, our model achieved a 0.89 weighted average F1 score, enabling accurate mapping of attributes into HL7 FHIR models. We also found that the BERT input representation predominantly contributed to the classifier decision as shown by the Shapley values analysis.
Collapse
Affiliation(s)
- Kevin J Peterson
- Department of Information Technology, Mayo Clinic, Rochester, MN 55905, United States; Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, United States.
| | - Guoqian Jiang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, United States; Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, United States.
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, United States; Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, United States.
| |
Collapse
|
22
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
23
|
Pesaranghader A, Matwin S, Sokolova M, Pesaranghader A. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data. J Am Med Inform Assoc 2020; 26:438-446. [PMID: 30811548 DOI: 10.1093/jamia/ocy189] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 12/03/2018] [Accepted: 12/19/2018] [Indexed: 01/05/2023] Open
Abstract
OBJECTIVE In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable. MATERIALS AND METHODS Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner. RESULTS We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy. CONCLUSIONS Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada.,Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax, NS B3H 4R2, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada.,School of Epidemiology and Public Health, University of Ottawa, University of Ottawa, Ottawa, ON K1G 5Z3, Canada
| | - Ali Pesaranghader
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
24
|
Cullen CM, Aneja KK, Beyhan S, Cho CE, Woloszynek S, Convertino M, McCoy SJ, Zhang Y, Anderson MZ, Alvarez-Ponce D, Smirnova E, Karstens L, Dorrestein PC, Li H, Sen Gupta A, Cheung K, Powers JG, Zhao Z, Rosen GL. Emerging Priorities for Microbiome Research. Front Microbiol 2020; 11:136. [PMID: 32140140 PMCID: PMC7042322 DOI: 10.3389/fmicb.2020.00136] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 01/21/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome research has increased dramatically in recent years, driven by advances in technology and significant reductions in the cost of analysis. Such research has unlocked a wealth of data, which has yielded tremendous insight into the nature of the microbial communities, including their interactions and effects, both within a host and in an external environment as part of an ecological community. Understanding the role of microbiota, including their dynamic interactions with their hosts and other microbes, can enable the engineering of new diagnostic techniques and interventional strategies that can be used in a diverse spectrum of fields, spanning from ecology and agriculture to medicine and from forensics to exobiology. From June 19-23 in 2017, the NIH and NSF jointly held an Innovation Lab on Quantitative Approaches to Biomedical Data Science Challenges in our Understanding of the Microbiome. This review is inspired by some of the topics that arose as priority areas from this unique, interactive workshop. The goal of this review is to summarize the Innovation Lab's findings by introducing the reader to emerging challenges, exciting potential, and current directions in microbiome research. The review is broken into five key topic areas: (1) interactions between microbes and the human body, (2) evolution and ecology of microbes, including the role played by the environment and microbe-microbe interactions, (3) analytical and mathematical methods currently used in microbiome research, (4) leveraging knowledge of microbial composition and interactions to develop engineering solutions, and (5) interventional approaches and engineered microbiota that may be enabled by selectively altering microbial composition. As such, this review seeks to arm the reader with a broad understanding of the priorities and challenges in microbiome research today and provide inspiration for future investigation and multi-disciplinary collaboration.
Collapse
Affiliation(s)
- Chad M. Cullen
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States
| | | | - Sinem Beyhan
- Department of Infectious Diseases, J. Craig Venter Institute, La Jolla, CA, United States
| | - Clara E. Cho
- Department of Nutrition, Dietetics and Food Sciences, Utah State University, Logan, UT, United States
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
- College of Medicine, Drexel University, Philadelphia, PA, United States
| | - Matteo Convertino
- Nexus Group, Faculty of Information Science and Technology, Gi-CoRE Station for Big Data & Cybersecurity, Hokkaido University, Sapporo, Japan
| | - Sophie J. McCoy
- Department of Biological Science, Florida State University, Tallahassee, FL, United States
| | - Yanyan Zhang
- Department of Civil Engineering, New Mexico State University, Las Cruces, NM, United States
| | - Matthew Z. Anderson
- Department of Microbiology, The Ohio State University, Columbus, OH, United States
- Department of Microbial Infection and Immunity, The Ohio State University, Columbus, OH, United States
| | | | - Ekaterina Smirnova
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, United States
| | - Lisa Karstens
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
- Department of Obstetrics and Gynecology, Oregon Health & Science University, Portland, OR, United States
| | - Pieter C. Dorrestein
- Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, San Diego, CA, United States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Ananya Sen Gupta
- Department of Electrical and Computer Engineering, The University of Iowa, Iowa City, IA, United States
| | - Kevin Cheung
- Department of Dermatology, The University of Iowa, Iowa City, IA, United States
| | | | - Zhengqiao Zhao
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
| | - Gail L. Rosen
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
| |
Collapse
|
25
|
Li Y, Jin R, Luo Y. Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (Seg-GCRNs). J Am Med Inform Assoc 2019; 26:262-268. [PMID: 30590613 DOI: 10.1093/jamia/ocy157] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Accepted: 11/03/2018] [Indexed: 01/16/2023] Open
Abstract
We propose to use segment graph convolutional and recurrent neural networks (Seg-GCRNs), which use only word embedding and sentence syntactic dependencies, to classify relations from clinical notes without manual feature engineering. In this study, the relations between 2 medical concepts are classified by simultaneously learning representations of text segments in the context of sentence syntactic dependency: preceding, concept1, middle, concept2, and succeeding segments. Seg-GCRN was systematically evaluated on the i2b2/VA relation classification challenge datasets. Experiments show that Seg-GCRN attains state-of-the-art micro-averaged F-measure for all 3 relation categories: 0.692 for classifying medical treatment-problem relations, 0.827 for medical test-problem relations, and 0.741 for medical problem-medical problem relations. Comparison with the previous state-of-the-art segment convolutional neural network (Seg-CNN) suggests that adding syntactic dependency information helps refine medical word embedding and improves concept relation classification without manual feature engineering. Seg-GCRN can be trained efficiently for the i2b2/VA dataset on a GPU platform.
Collapse
Affiliation(s)
- Yifu Li
- Grado Department of Industrial and Systems Engineering (ISE), Virginia Tech, Blacksburg, Virginia, USA
| | - Ran Jin
- Grado Department of Industrial and Systems Engineering (ISE), Virginia Tech, Blacksburg, Virginia, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, Illinois, USA
| |
Collapse
|
26
|
Rosemblat G, Fiszman M, Shin D, Kilicoglu H. Towards a characterization of apparent contradictions in the biomedical literature using context analysis. J Biomed Inform 2019; 98:103275. [PMID: 31473364 DOI: 10.1016/j.jbi.2019.103275] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 08/26/2019] [Accepted: 08/28/2019] [Indexed: 11/19/2022]
Abstract
BACKGROUND With the substantial growth in the biomedical research literature, a larger number of claims are published daily, some of which seemingly disagree with or contradict prior claims on the same topics. Resolving such contradictions is critical to advancing our understanding of human disease and developing effective treatments. Automated text analysis techniques can facilitate such analysis by extracting claims from the literature, flagging those that are potentially contradictory, and identifying any study characteristics that may explain such contradictions. METHODS Using SemMedDB, our own PubMed-scale repository of semantic predications (subject-relation-object triples), we identified apparent contradictions in the biomedical research literature and developed a categorization of contextual characteristics that explain such contradictions. Clinically relevant semantic predications relating to 20 diseases and involving opposing predicate pairs (e.g., an intervention treats or causes a disease) were retrieved from SemMedDB. After addressing inference, uncertainty, generic concepts, and NLP errors through automatic and manual filtering steps, a set of apparent contradictions were identified and characterized. RESULTS We retrieved 117,676 predication instances from 62,360 PubMed abstracts (Jan 1980-Dec 2016). From these instances, automatic filtering steps generated 2236 candidate contradictory pairs. Through manual analysis, we determined that 58 of these pairs (2.6%) were apparent contradictions. We identified five main categories of contextual characteristics that explain these contradictions: (a) internal to the patient, (b) external to the patient, (c) endogenous/exogenous, (d) known controversy, and (e) contradictions in literature. Categories (a) and (b) were subcategorized further (e.g., species, dosage) and accounted for the bulk of the contradictory information. CONCLUSIONS Semantic predications, by accounting for lexical variability, and SemMedDB, owing to its literature scale, can support identification and elucidation of potentially contradictory claims across the biomedical domain. Further filtering and classification steps are needed to distinguish among them the true contradictory claims. The ability to detect contradictions automatically can facilitate important biomedical knowledge management tasks, such as tracking and verifying scientific claims, summarizing research on a given topic, identifying knowledge gaps, and assessing evidence for systematic reviews, with potential benefits to the scientific community. Future work will focus on automating these steps for fully automatic recognition of contradictions from the biomedical research literature.
Collapse
Affiliation(s)
- Graciela Rosemblat
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Marcelo Fiszman
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Dongwook Shin
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Halil Kilicoglu
- National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| |
Collapse
|
27
|
Minimalistic Approach to Coreference Resolution in Lithuanian Medical Records. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:9079840. [PMID: 31015858 PMCID: PMC6446105 DOI: 10.1155/2019/9079840] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 02/26/2019] [Indexed: 12/20/2022]
Abstract
Coreference resolution is a challenging part of natural language processing (NLP) with applications in machine translation, semantic search and other information retrieval, and decision support systems. Coreference resolution requires linguistic preprocessing and rich language resources for automatically identifying and resolving such expressions. Many rarer and under-resourced languages (such as Lithuanian) lack the required language resources and tools. We present a method for coreference resolution in Lithuanian language and its application for processing e-health records from a hospital reception. Our novelty is the ability to process coreferences with minimal linguistic resources, which is important in linguistic applications for rare and endangered languages. The experimental results show that coreference resolution is applicable to the development of NLP-powered online healthcare services in Lithuania.
Collapse
|
28
|
Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med (Lausanne) 2019; 6:66. [PMID: 31058150 PMCID: PMC6478793 DOI: 10.3389/fmed.2019.00066] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 03/18/2019] [Indexed: 01/01/2023] Open
Abstract
Problem: Clinical practice requires the production of a time- and resource-consuming great amount of notes. They contain relevant information, but their secondary use is almost impossible, due to their unstructured nature. Researchers are trying to address this problems, with traditional and promising novel techniques. Application in real hospital settings seems not to be possible yet, though, both because of relatively small and dirty dataset, and for the lack of language-specific pre-trained models. Aim: Our aim is to demonstrate the potential of the above techniques, but also raise awareness of the still open challenges that the scientific communities of IT and medical practitioners must jointly address to realize the full potential of unstructured content that is daily produced and digitized in hospital settings, both to improve its data quality and leverage the insights from data-driven predictive models. Methods: To this extent, we present a narrative literature review of the most recent and relevant contributions to leverage the application of Natural Language Processing techniques to the free-text content electronic patient records. In particular, we focused on four selected application domains, namely: data quality, information extraction, sentiment analysis and predictive models, and automated patient cohort selection. Then, we will present a few empirical studies that we undertook at a major teaching hospital specializing in musculoskeletal diseases. Results: We provide the reader with some simple and affordable pipelines, which demonstrate the feasibility of reaching literature performance levels with a single institution non-English dataset. In such a way, we bridged literature and real world needs, performing a step further toward the revival of notes fields.
Collapse
Affiliation(s)
- Michela Assale
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Linda Greta Dui
- Politecnico di Milano, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Andrea Cina
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Andrea Seveso
- University of Milano-Bicocca, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Federico Cabitza
- University of Milano-Bicocca, Milan, Italy
- IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
29
|
Luo Y, Cheng Y, Uzuner Ö, Szolovits P, Starren J. Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. J Am Med Inform Assoc 2019; 25:93-98. [PMID: 29025149 DOI: 10.1093/jamia/ocx090] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Accepted: 08/05/2017] [Indexed: 11/13/2022] Open
Abstract
We propose Segment Convolutional Neural Networks (Seg-CNNs) for classifying relations from clinical notes. Seg-CNNs use only word-embedding features without manual feature engineering. Unlike typical CNN models, relations between 2 concepts are identified by simultaneously learning separate representations for text segments in a sentence: preceding, concept1, middle, concept2, and succeeding. We evaluate Seg-CNN on the i2b2/VA relation classification challenge dataset. We show that Seg-CNN achieves a state-of-the-art micro-average F-measure of 0.742 for overall evaluation, 0.686 for classifying medical problem-treatment relations, 0.820 for medical problem-test relations, and 0.702 for medical problem-medical problem relations. We demonstrate the benefits of learning segment-level representations. We show that medical domain word embeddings help improve relation classification. Seg-CNNs can be trained quickly for the i2b2/VA dataset on a graphics processing unit (GPU) platform. These results support the use of CNNs computed over segments of text for classifying medical relations, as they show state-of-the-art performance while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Yu Cheng
- AI Foundations, IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | - Özlem Uzuner
- Department of Computer Science, State University of New York at Albany, Albany, NY, USA
| | - Peter Szolovits
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Justin Starren
- Department of Preventive Medicine and Medical Social Science, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| |
Collapse
|
30
|
Zeng Z, Deng Y, Li X, Naumann T, Luo Y. Natural Language Processing for EHR-Based Computational Phenotyping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:139-153. [PMID: 29994486 PMCID: PMC6388621 DOI: 10.1109/tcbb.2018.2849968] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This article reviews recent advances in applying natural language processing (NLP) to Electronic Health Records (EHRs) for computational phenotyping. NLP-based computational phenotyping has numerous applications including diagnosis categorization, novel phenotype discovery, clinical trial screening, pharmacogenomics, drug-drug interaction (DDI), and adverse drug event (ADE) detection, as well as genome-wide and phenome-wide association studies. Significant progress has been made in algorithm development and resource construction for computational phenotyping. Among the surveyed methods, well-designed keyword search and rule-based systems often achieve good performance. However, the construction of keyword and rule lists requires significant manual effort, which is difficult to scale. Supervised machine learning models have been favored because they are capable of acquiring both classification patterns and structures from data. Recently, deep learning and unsupervised learning have received growing attention, with the former favored for its performance and the latter for its ability to find novel phenotypes. Integrating heterogeneous data sources have become increasingly important and have shown promise in improving model performance. Often, better performance is achieved by combining multiple modalities of information. Despite these many advances, challenges and opportunities remain for NLP-based computational phenotyping, including better model interpretability and generalizability, and proper characterization of feature relations in clinical narratives.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| | - Yu Deng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| | - Xiaoyu Li
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA 02115.
| | - Tristan Naumann
- Science and Artificial Intelligence Lab, Massachusetts Institue of Technology, Cambridge, MA 02139.
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611.
| |
Collapse
|
31
|
Li Y, Yao L, Mao C, Srivastava A, Jiang X, Luo Y. Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2018; 2018:683-686. [PMID: 33376624 PMCID: PMC7768909 DOI: 10.1109/bibm.2018.8621574] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the first 24 hours following intensive care unit (ICU) admission extracted from Medical Information Mart for Intensive Care III (MIMIC-III). From the clinical notes, we generated clinically meaningful word and concept representations and embeddings, respectively. Five supervised learning classifiers and knowledge-guided deep learning architecture were used to construct prediction models. The best configuration yielded a competitive AUC of 0.779. Our work suggests that natural language processing of clinical notes can be applied to assist clinicians in identifying the risk of incident AKI onset in critically ill patients upon admission to the ICU.
Collapse
Affiliation(s)
- Yikuan Li
- Dept. of EECS, Northwestern University, Evanston, IL, U.S.A
| | | | - Chengsheng Mao
- Dept. of Preventive Medicine, Northwestern University, Chicago, IL, U.S.A
| | - Anand Srivastava
- Div. of Nephrology and Hypertension, Northwestern University, Chicago, IL, U.S.A
| | - Xiaoqian Jiang
- School of Biomedical Informatics, Univ. of Texas Health Science Center, Houston, TX, U.S.A
| | - Yuan Luo
- Dept. of Preventive Medicine, Northwestern University, Chicago, IL, U.S.A
| |
Collapse
|
32
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
33
|
Xing W, Qi J, Yuan X, Li L, Zhang X, Fu Y, Xiong S, Hu L, Peng J. A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach. Bioinformatics 2018; 34:i386-i394. [PMID: 29950017 PMCID: PMC6022650 DOI: 10.1093/bioinformatics/bty263] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Motivation The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants. Results We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rule-based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene-phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it. Availability and implementation The source code is available at http://www.wutbiolab.cn: 82/Gene-Phenotype-Relation-Extraction-Pipeline.zip. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenhui Xing
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Junsheng Qi
- Department of Plant Science, College of Biological Science, China Agricultural University, Beijing, China
| | - Xiaohui Yuan
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lin Li
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Xiaoyu Zhang
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Yuhua Fu
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Shengwu Xiong
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lun Hu
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Jing Peng
- School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| |
Collapse
|
34
|
Bakal G, Talari P, Kakani EV, Kavuluru R. Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations. J Biomed Inform 2018; 82:189-199. [PMID: 29763706 PMCID: PMC6070294 DOI: 10.1016/j.jbi.2018.05.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Revised: 01/31/2018] [Accepted: 05/09/2018] [Indexed: 01/27/2023]
Abstract
BACKGROUND Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying different causal relations between biomedical entities is also critical to understand biomedical processes. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach. OBJECTIVE To build high accuracy supervised predictive models to predict previously unknown treatment and causative relations between biomedical entities based only on semantic graph pattern features extracted from biomedical knowledge graphs. METHODS We used 7000 treats and 2918 causes hand-curated relations from the UMLS Metathesaurus to train and test our models. Our graph pattern features are extracted from simple paths connecting biomedical entities in the SemMedDB graph (based on the well-known SemMedDB database made available by the U.S. National Library of Medicine). Using these graph patterns connecting biomedical entities as features of logistic regression and decision tree models, we computed mean performance measures (precision, recall, F-score) over 100 distinct 80-20% train-test splits of the datasets. For all experiments, we used a positive:negative class imbalance of 1:10 in the test set to model relatively more realistic scenarios. RESULTS Our models predict treats and causes relations with high F-scores of 99% and 90% respectively. Logistic regression model coefficients also help us identify highly discriminative patterns that have an intuitive interpretation. We are also able to predict some new plausible relations based on false positives that our models scored highly based on our collaborations with two physician co-authors. Finally, our decision tree models are able to retrieve over 50% of treatment relations from a recently created external dataset. CONCLUSIONS We employed semantic graph patterns connecting pairs of candidate biomedical entities in a knowledge graph as features to predict treatment/causative relations between them. We provide what we believe is the first evidence in direct prediction of biomedical relations based on graph features. Our work complements lexical pattern based approaches in that the graph patterns can be used as additional features for weakly supervised relation prediction.
Collapse
Affiliation(s)
- Gokhan Bakal
- Department of Computer Science, University of Kentucky, United States.
| | - Preetham Talari
- Division of Hospital Medicine, Department of Internal Medicine, University of Kentucky, United States.
| | - Elijah V Kakani
- Division of Hospital Medicine, Department of Internal Medicine, University of Kentucky, United States.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States; Department of Computer Science, University of Kentucky, United States.
| |
Collapse
|
35
|
Maldonado R, Goodwin TR, Harabagiu SM. Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2018; 2017:156-165. [PMID: 29888063 PMCID: PMC5961777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The automatic identification of relations between medical concepts in a large corpus of Electroencephalography (EEG) reports is an important step in the development of an EEG-specific patient cohort retrieval system as well as in the acquisition of EEG-specific knowledge from this corpus. EEG-specific relations involve medical concepts that are not typically mentioned in the same sentence or even the same section of a report, thus requiring extraction techniques that can handle such long-distance dependencies. To address this challenge, we present a novel frame work which combines the advantages of a deep learning framework employing Dynamic Relational Memory (DRM) with active learning. While DRM enables the prediction of long-distance relations, active learning provides a mechanism for accurately identifying relations with minimal training data, obtaining an 5-fold cross validationF1 score of 0.7475 on a set of 140 EEG reports selected with active learning. The results obtained with our novel framework show great promise.
Collapse
|
36
|
Luo Y. Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 2017; 72:85-95. [PMID: 28694119 PMCID: PMC6657689 DOI: 10.1016/j.jbi.2017.07.006] [Citation(s) in RCA: 85] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 06/13/2017] [Accepted: 07/06/2017] [Indexed: 01/16/2023]
Abstract
We proposed the first models based on recurrent neural networks (more specifically Long Short-Term Memory - LSTM) for classifying relations from clinical notes. We tested our models on the i2b2/VA relation classification challenge dataset. We showed that our segment LSTM model, with only word embedding feature and no manual feature engineering, achieved a micro-averaged f-measure of 0.661 for classifying medical problem-treatment relations, 0.800 for medical problem-test relations, and 0.683 for medical problem-medical problem relations. These results are comparable to those of the state-of-the-art systems on the i2b2/VA relation classification challenge. We compared the segment LSTM model with the sentence LSTM model, and demonstrated the benefits of exploring the difference between concept text and context text, and between different contextual parts in the sentence. We also evaluated the impact of word embedding on the performance of LSTM models and showed that medical domain word embedding help improve the relation classification. These results support the use of LSTM models for classifying relations between medical concepts, as they show comparable performance to previously published systems while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Division of Health and Biomedical Informatics, Northwestern University, Chicago, IL, United States.
| |
Collapse
|
37
|
Eftimov T, Koroušić Seljak B, Korošec P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One 2017. [PMID: 28644863 PMCID: PMC5482438 DOI: 10.1371/journal.pone.0179488] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.
Collapse
Affiliation(s)
- Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- * E-mail:
| | | | - Peter Korošec
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Faculty of Mathematics, Natural Science and Information Technologies, Koper, Slovenia
| |
Collapse
|
38
|
|
39
|
Zhang Y, Jiang M, Wang J, Xu H. Semantic Role Labeling of Clinical Text: Comparing Syntactic Parsers and Features. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1283-1292. [PMID: 28269926 PMCID: PMC5333340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Semantic role labeling (SRL), which extracts shallow semantic relation representation from different surface textual forms of free text sentences, is important for understanding clinical narratives. Since semantic roles are formed by syntactic constituents in the sentence, an effective parser, as well as an effective syntactic feature set are essential to build a practical SRL system. Our study initiates a formal evaluation and comparison of SRL performance on a clinical text corpus MiPACQ, using three state-of-the-art parsers, the Stanford parser, the Berkeley parser, and the Charniak parser. First, the original parsers trained on the open domain syntactic corpus Penn Treebank were employed. Next, those parsers were retrained on the clinical Treebank of MiPACQ for further comparison. Additionally, state-of-the-art syntactic features from open domain SRL were also examined for clinical text. Experimental results showed that retraining the parsers on clinical Treebank improved the performance significantly, with an optimal F1 measure of 71.41% achieved by the Berkeley parser.
Collapse
Affiliation(s)
- Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Min Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
40
|
Luo Y, Szolovits P. Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records. BIOMEDICAL INFORMATICS INSIGHTS 2016; 8:29-38. [PMID: 27478379 PMCID: PMC4954589 DOI: 10.4137/bii.s38916] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Revised: 06/13/2016] [Accepted: 06/22/2016] [Indexed: 11/07/2022]
Abstract
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen’s interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen’s relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.
Collapse
Affiliation(s)
- Yuan Luo
- Assistant Professor, Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Peter Szolovits
- Professor, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|