1
|
Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024; 2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]
Abstract
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| | - Farrokh Mehryary
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Tomoko Ohta
- Textimi, 1-37-13 Kitazawa, Tokyo, Setagaya-ku 155-0031, Japan
| | - Jouni Luoma
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark
| |
Collapse
|
2
|
Sikirzhytskaya A, Tyagin I, Sutton SS, Wyatt MD, Safro I, Shtutman M. AI-based mining of biomedical literature: Applications for drug repurposing for the treatment of dementia. RESEARCH SQUARE 2024:rs.3.rs-4750719. [PMID: 39184100 PMCID: PMC11343300 DOI: 10.21203/rs.3.rs-4750719/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Neurodegenerative pathologies such as Alzheimer's disease, Parkinson's disease, Huntington's disease, Amyotrophic lateral sclerosis, Multiple sclerosis, HIV-associated neurocognitive disorder, and others significantly affect individuals, their families, caregivers, and healthcare systems. While there are no cures yet, researchers worldwide are actively working on the development of novel treatments that have the potential to slow disease progression, alleviate symptoms, and ultimately improve the overall health of patients. Huge volumes of new scientific information necessitate new analytical approaches for meaningful hypothesis generation. To enable the automatic analysis of biomedical data we introduced AGATHA, an effective AI-based literature mining tool that can navigate massive scientific literature databases, such as PubMed. The overarching goal of this effort is to adapt AGATHA for drug repurposing by revealing hidden connections between FDA-approved medications and a health condition of interest. Our tool converts the abstracts of peer-reviewed papers from PubMed into multidimensional space where each gene and health condition are represented by specific metrics. We implemented advanced statistical analysis to reveal distinct clusters of scientific terms within the virtual space created using AGATHA-calculated parameters for selected health conditions and genes. Partial Least Squares Discriminant Analysis was employed for categorizing and predicting samples (122 diseases and 20889 genes) fitted to specific classes. Advanced statistics were employed to build a discrimination model and extract lists of genes specific to each disease class. Here we focus on drugs that can be repurposed for dementia treatment as an outcome of neurodegenerative diseases. Therefore, we determined dementia-associated genes statistically highly ranked in other disease classes. Additionally, we report a mechanism for detecting genes common to multiple health conditions. These sets of genes were classified based on their presence in biological pathways, aiding in selecting candidates and biological processes that are exploitable with drug repurposing.
Collapse
|
3
|
Sikirzhytskaya A, Tyagin I, Sutton SS, Wyatt MD, Safro I, Shtutman M. AI-based mining of biomedical literature: Applications for drug repurposing for the treatment of dementia. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.06.597745. [PMID: 38895485 PMCID: PMC11185689 DOI: 10.1101/2024.06.06.597745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Neurodegenerative pathologies such as Alzheimer's disease, Parkinson's disease, Huntington's disease, Amyotrophic lateral sclerosis, Multiple sclerosis, HIV-associated neurocognitive disorder, and others significantly affect individuals, their families, caregivers, and healthcare systems. While there are no cures yet, researchers worldwide are actively working on the development of novel treatments that have the potential to slow disease progression, alleviate symptoms, and ultimately improve the overall health of patients. Huge volumes of new scientific information necessitate new analytical approaches for meaningful hypothesis generation. To enable the automatic analysis of biomedical data we introduced AGATHA, an effective AI-based literature mining tool that can navigate massive scientific literature databases, such as PubMed. The overarching goal of this effort is to adapt AGATHA for drug repurposing by revealing hidden connections between FDA-approved medications and a health condition of interest. Our tool converts the abstracts of peer-reviewed papers from PubMed into multidimensional space where each gene and health condition are represented by specific metrics. We implemented advanced statistical analysis to reveal distinct clusters of scientific terms within the virtual space created using AGATHA-calculated parameters for selected health conditions and genes. Partial Least Squares Discriminant Analysis was employed for categorizing and predicting samples (122 diseases and 20889 genes) fitted to specific classes. Advanced statistics were employed to build a discrimination model and extract lists of genes specific to each disease class. Here we focus on drugs that can be repurposed for dementia treatment as an outcome of neurodegenerative diseases. Therefore, we determined dementia-associated genes statistically highly ranked in other disease classes. Additionally, we report a mechanism for detecting genes common to multiple health conditions. These sets of genes were classified based on their presence in biological pathways, aiding in selecting candidates and biological processes that are exploitable with drug repurposing. Author Summary This manuscript outlines our project involving the application of AGATHA, an AI-based literature mining tool, to discover drugs with the potential for repurposing in the context of neurocognitive disorders. The primary objective is to identify connections between approved medications and specific health conditions through advanced statistical analysis, including techniques like Partial Least Squares Discriminant Analysis (PLSDA) and unsupervised clustering. The methodology involves grouping scientific terms related to different health conditions and genes, followed by building discrimination models to extract lists of disease-specific genes. These genes are then analyzed through pathway analysis to select candidates for drug repurposing.
Collapse
|
4
|
Nachtegael C, De Stefani J, Cnudde A, Lenaerts T. DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations. Database (Oxford) 2024; 2024:baae039. [PMID: 38805753 PMCID: PMC11131422 DOI: 10.1093/database/baae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/17/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
| | - Anthony Cnudde
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Pharmacologie, Pharmacothérapie et Suivi Pharmaceutique, Université Libre de Bruxelles, Boulevard du Triomphe, CP 205, Brussels 1050, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
5
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
6
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets. J Biomed Inform 2023; 146:104487. [PMID: 37673376 DOI: 10.1016/j.jbi.2023.104487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/18/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.
| |
Collapse
|
7
|
Lai PT, Wei CH, Luo L, Chen Q, Lu Z. BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets. ARXIV 2023:arXiv:2306.11189v1. [PMID: 37502629 PMCID: PMC10370213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA
| |
Collapse
|
8
|
Tong Y, Tan F, Huang H, Zhang Z, Zong H, Xie Y, Huang D, Cheng S, Wei Z, Fang M, Crabbe MJC, Wang Y, Zhang X. ViMRT: a text-mining tool and search engine for automated virus mutation recognition. Bioinformatics 2022; 39:6808671. [PMID: 36342236 PMCID: PMC9805560 DOI: 10.1093/bioinformatics/btac721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 10/24/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open
Abstract
MOTIVATION Virus mutation is one of the most important research issues which plays a critical role in disease progression and has prompted substantial scientific publications. Mutation extraction from published literature has become an increasingly important task, benefiting many downstream applications such as vaccine design and drug usage. However, most existing approaches have low performances in extracting virus mutation due to both lack of precise virus mutation information and their development based on human gene mutations. RESULTS We developed ViMRT, a text-mining tool and search engine for automated virus mutation recognition using natural language processing. ViMRT mainly developed 8 optimized rules and 12 regular expressions based on a development dataset comprising 830 papers of 5 human severe disease-related viruses. It achieved higher performance than other tools in a test dataset (1662 papers, 99.17% in F1-score) and has been applied well to two other viruses, influenza virus and severe acute respiratory syndrome coronavirus-2 (212 papers, 96.99% in F1-score). These results indicate that ViMRT is a high-performance method for the extraction of virus mutation from the biomedical literature. Besides, we present a search engine for researchers to quickly find and accurately search virus mutation-related information including virus genes and related diseases. AVAILABILITY AND IMPLEMENTATION ViMRT software is freely available at http://bmtongji.cn:1225/mutation/index.
Collapse
Affiliation(s)
- Yuantao Tong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Fanglin Tan
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Honglian Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Zeyu Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Hui Zong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Yujia Xie
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Danqi Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Shiyang Cheng
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Ziyi Wei
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Meng Fang
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai 200438, China
| | - M James C Crabbe
- Wolfson College, Oxford University, Oxford OX2 6UD, UK
- Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton LU1 3JU, UK
- School of Life Sciences, Shanxi University, Taiyuan 030006, China
| | - Ying Wang
- To whom correspondence should be addressed. or
| | | |
Collapse
|
9
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
10
|
Goto A, Rodriguez-Esteban R, Scharf SH, Morris GM. Understanding the genetics of viral drug resistance by integrating clinical data and mining of the scientific literature. Sci Rep 2022; 12:14476. [PMID: 36008431 PMCID: PMC9403226 DOI: 10.1038/s41598-022-17746-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 07/30/2022] [Indexed: 11/16/2022] Open
Abstract
Drug resistance caused by mutations is a public health threat for existing and emerging viral diseases. A wealth of evidence about these mutations and their clinically associated phenotypes is scattered across the literature, but a comprehensive perspective is usually lacking. This work aimed to produce a clinically relevant view for the case of Hepatitis B virus (HBV) mutations by combining a chronic HBV clinical study with a compendium of genetic mutations systematically gathered from the scientific literature. We enriched clinical mutation data by systematically mining 2,472,725 scientific articles from PubMed Central in order to gather information about the HBV mutational landscape. By performing this analysis, we were able to identify mutational hotspots for each HBV genotype (A-E) and gene (C, X, P, S), as well as the location of disulfide bonds associated with these mutations. Through a modelling study, we also identified a mutation position common in both the clinical data and the literature that is located at the binding pocket for a known anti-HBV drug, namely entecavir. The results of this novel approach show the potential of integrated analyses to assist in the development of new drugs for viral diseases that are more robust to resistance. Such analyses should be of particular interest due to the increasing importance of viral resistance in established and emerging viruses, such as for newly developed drugs against SARS-CoV-2.
Collapse
Affiliation(s)
- An Goto
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK
| | | | | | - Garrett M Morris
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK.
| |
Collapse
|
11
|
Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021; 12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. Results We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. Conclusions ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
Collapse
Affiliation(s)
- Ton E Becker
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA
| | - Eric Jakobsson
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA. .,Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
| |
Collapse
|
12
|
Birgmeier J, Haeussler M, Deisseroth CA, Steinberg EH, Jagadeesh KA, Ratner AJ, Guturu H, Wenger AM, Diekhans ME, Stenson PD, Cooper DN, Ré C, Beggs AH, Bernstein JA, Bejerano G. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci Transl Med 2021; 12:12/544/eaau9113. [PMID: 32434849 DOI: 10.1126/scitranslmed.aau9113] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 08/14/2019] [Accepted: 04/22/2020] [Indexed: 12/21/2022]
Abstract
The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.
Collapse
Affiliation(s)
- Johannes Birgmeier
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Maximilian Haeussler
- Santa Cruz Genomics Institute, MS CBSE, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Cole A Deisseroth
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Ethan H Steinberg
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Karthik A Jagadeesh
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Alexander J Ratner
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Harendra Guturu
- Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA
| | - Aaron M Wenger
- Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA
| | - Mark E Diekhans
- Santa Cruz Genomics Institute, MS CBSE, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Peter D Stenson
- Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
| | - David N Cooper
- Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK
| | - Christopher Ré
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Alan H Beggs
- Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | | | - Gill Bejerano
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA. .,Department of Pediatrics, Stanford School of Medicine, Stanford, CA 94305, USA.,Department of Developmental Biology, Stanford University, Stanford, CA 94305, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
13
|
Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021; 22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION To obtain key information for personalized medicine and cancer research, clinicians and researchers in the biomedical field are in great need of searching genomic variant information from the biomedical literature now than ever before. Due to the various written forms of genomic variants, however, it is difficult to locate the right information from the literature when using a general literature search system. To address the difficulty of locating genomic variant information from the literature, researchers have suggested various solutions based on automated literature-mining techniques. There is, however, no study for summarizing and comparing existing tools for genomic variant literature mining in terms of how to search easily for information in the literature on genomic variants. RESULTS In this article, we systematically compared currently available genomic variant recognition and normalization tools as well as the literature search engines that adopted these literature-mining techniques. First, we explain the problems that are caused by the use of non-standard formats of genomic variants in the PubMed literature by considering examples from the literature and show the prevalence of the problem. Second, we review literature-mining tools that address the problem by recognizing and normalizing the various forms of genomic variants in the literature and systematically compare them. Third, we present and compare existing literature search engines that are designed for a genomic variant search by using the literature-mining techniques. We expect this work to be helpful for researchers who seek information about genomic variants from the literature, developers who integrate genomic variant information from the literature and beyond.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information
| | | | - Zhiyong Lu
- National Center for Biotechnology Information
| |
Collapse
|
14
|
Sharma B, Willis VC, Huettner CS, Beaty K, Snowdon JL, Xue S, South BR, Jackson GP, Weeraratne D, Michelini V. Predictive article recommendation using natural language processing and machine learning to support evidence updates in domain-specific knowledge graphs. JAMIA Open 2020; 3:332-337. [PMID: 33215067 PMCID: PMC7660962 DOI: 10.1093/jamiaopen/ooaa028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/26/2020] [Accepted: 06/19/2020] [Indexed: 11/14/2022] Open
Abstract
Objectives Describe an augmented intelligence approach to facilitate the update of evidence for associations in knowledge graphs. Methods New publications are filtered through multiple machine learning study classifiers, and filtered publications are combined with articles already included as evidence in the knowledge graph. The corpus is then subjected to named entity recognition, semantic dictionary mapping, term vector space modeling, pairwise similarity, and focal entity match to identify highly related publications. Subject matter experts review recommended articles to assess inclusion in the knowledge graph; discrepancies are resolved by consensus. Results Study classifiers achieved F-scores from 0.88 to 0.94, and similarity thresholds for each study type were determined by experimentation. Our approach reduces human literature review load by 99%, and over the past 12 months, 41% of recommendations were accepted to update the knowledge graph. Conclusion Integrated search and recommendation exploiting current evidence in a knowledge graph is useful for reducing human cognition load.
Collapse
Affiliation(s)
| | - Van C Willis
- IBM Watson Health, Cambridge, Massachusetts, USA
| | | | - Kirk Beaty
- IBM Watson Health, Cambridge, Massachusetts, USA
| | | | - Shang Xue
- IBM Watson Health, Cambridge, Massachusetts, USA
| | | | | | | | | |
Collapse
|
15
|
Saberian N, Shafi A, Peyvandipour A, Draghici S. MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature. Sci Rep 2020; 10:12365. [PMID: 32703994 PMCID: PMC7378213 DOI: 10.1038/s41598-020-68649-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 06/17/2020] [Indexed: 11/09/2022] Open
Abstract
In spite of the efforts in developing and maintaining accurate variant databases, a large number of disease-associated variants are still hidden in the biomedical literature. Curation of the biomedical literature in an effort to extract this information is a challenging task due to: (i) the complexity of natural language processing, (ii) inconsistent use of standard recommendations for variant description, and (iii) the lack of clarity and consistency in describing the variant-genotype-phenotype associations in the biomedical literature. In this article, we employ text mining and word cloud analysis techniques to address these challenges. The proposed framework extracts the variant-gene-disease associations from the full-length biomedical literature and designs an evidence-based variant-driven gene panel for a given condition. We validate the identified genes by showing their diagnostic abilities to predict the patients' clinical outcome on several independent validation cohorts. As representative examples, we present our results for acute myeloid leukemia (AML), breast cancer and prostate cancer. We compare these panels with other variant-driven gene panels obtained from Clinvar, Mastermind and others from literature, as well as with a panel identified with a classical differentially expressed genes (DEGs) approach. The results show that the panels obtained by the proposed framework yield better results than the other gene panels currently available in the literature.
Collapse
Affiliation(s)
- Nafiseh Saberian
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Azam Peyvandipour
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA.
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA.
| |
Collapse
|
16
|
Bugnon LA, Yones C, Raad J, Gerard M, Rubiolo M, Merino G, Pividori M, Di Persia L, Milone DH, Stegmayer G. DL4papers: a deep learning approach for the automatic interpretation of scientific articles. Bioinformatics 2020; 36:3499-3506. [DOI: 10.1093/bioinformatics/btaa111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Revised: 12/27/2019] [Accepted: 02/14/2020] [Indexed: 01/26/2023] Open
Abstract
Abstract
Motivation
In precision medicine, next-generation sequencing and novel preclinical reports have led to an increasingly large amount of results, published in the scientific literature. However, identifying novel treatments or predicting a drug response in, for example, cancer patients, from the huge amount of papers available remains a laborious and challenging work. This task can be considered a text mining problem that requires reading a lot of academic documents for identifying a small set of papers describing specific relations between key terms. Due to the infeasibility of the manual curation of these relations, computational methods that can automatically identify them from the available literature are urgently needed.
Results
We present DL4papers, a new method based on deep learning that is capable of analyzing and interpreting papers in order to automatically extract relevant relations between specific keywords. DL4papers receives as input a query with the desired keywords, and it returns a ranked list of papers that contain meaningful associations between the keywords. The comparison against related methods showed that our proposal outperformed them in a cancer corpus. The reliability of the DL4papers output list was also measured, revealing that 100% of the first two documents retrieved for a particular search have relevant relations, in average. This shows that our model can guarantee that in the top-2 papers of the ranked list, the relation can be effectively found. Furthermore, the model is capable of highlighting, within each document, the specific fragments that have the associations of the input keywords. This can be very useful in order to pay attention only to the highlighted text, instead of reading the full paper. We believe that our proposal could be used as an accurate tool for rapidly identifying relationships between genes and their mutations, drug responses and treatments in the context of a certain disease. This new approach can certainly be a very useful and valuable resource for the advancement of the precision medicine field.
Availability and implementation
A web-demo is available at: http://sinc.unl.edu.ar/web-demo/dl4papers/. Full source code and data are available at: https://sourceforge.net/projects/sourcesinc/files/dl4papers/.
Contact
lbugnon@sinc.unl.edu.ar
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- L A Bugnon
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - C Yones
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - J Raad
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - M Gerard
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - M Rubiolo
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - G Merino
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
- Bioengineering and Bioinformatics Research and Development Institute, IBB, FIUNER-CONICET, Ruta Prov 11, Km 10.5, Oro Verde 3100, Argentina
| | - M Pividori
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
| | - L Di Persia
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - D H Milone
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - G Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| |
Collapse
|
17
|
Allot A, Peng Y, Wei CH, Lee K, Phan L, Lu Z. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 2019; 46:W530-W536. [PMID: 29762787 PMCID: PMC6030971 DOI: 10.1093/nar/gky355] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 05/08/2018] [Indexed: 01/10/2023] Open
Abstract
The identification and interpretation of genomic variants play a key role in the diagnosis of genetic diseases and related research. These tasks increasingly rely on accessing relevant manually curated information from domain databases (e.g. SwissProt or ClinVar). However, due to the sheer volume of medical literature and high cost of expert curation, curated variant information in existing databases are often incomplete and out-of-date. In addition, the same genetic variant can be mentioned in publications with various names (e.g. ‘A146T’ versus ‘c.436G>A’ versus ‘rs121913527’). A search in PubMed using only one name usually cannot retrieve all relevant articles for the variant of interest. Hence, to help scientists, healthcare professionals, and database curators find the most up-to-date published variant research, we have developed LitVar for the search and retrieval of standardized variant information. In addition, LitVar uses advanced text mining techniques to compute and extract relationships between variants and other associated entities such as diseases and chemicals/drugs. LitVar is publicly available at https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yifan Peng
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
18
|
Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, Beaty KA, Dehan E, Parikh B. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019; 138:109-124. [PMID: 30671672 PMCID: PMC6373233 DOI: 10.1007/s00439-019-01970-5] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 01/02/2019] [Indexed: 02/07/2023]
Abstract
In the field of cancer genomics, the broad availability of genetic information offered by next-generation sequencing technologies and rapid growth in biomedical publication has led to the advent of the big-data era. Integration of artificial intelligence (AI) approaches such as machine learning, deep learning, and natural language processing (NLP) to tackle the challenges of scalability and high dimensionality of data and to transform big data into clinically actionable knowledge is expanding and becoming the foundation of precision medicine. In this paper, we review the current status and future directions of AI application in cancer genomics within the context of workflows to integrate genomic analysis for precision cancer care. The existing solutions of AI and their limitations in cancer genetic testing and diagnostics such as variant calling and interpretation are critically analyzed. Publicly available tools or algorithms for key NLP technologies in the literature mining for evidence-based clinical recommendations are reviewed and compared. In addition, the present paper highlights the challenges to AI adoption in digital healthcare with regard to data requirements, algorithmic transparency, reproducibility, and real-world assessment, and discusses the importance of preparing patients and physicians for modern digitized healthcare. We believe that AI will remain the main driver to healthcare transformation toward precision medicine, yet the unprecedented challenges posed should be addressed to ensure safety and beneficial impact to healthcare.
Collapse
Affiliation(s)
- Jia Xu
- IBM Watson Health, Cambridge, MA, USA.
| | | | - Shang Xue
- IBM Watson Health, Cambridge, MA, USA
| | | | | | - Fang Wang
- IBM Watson Health, Cambridge, MA, USA
| | | | | | | |
Collapse
|
19
|
Abstract
Recent advances in technology have led to the exponential growth of scientific literature in biomedical sciences. This rapid increase in information has surpassed the threshold for manual curation efforts, necessitating the use of text mining approaches in the field of life sciences. One such application of text mining is in fostering in silico drug discovery such as drug target screening, pharmacogenomics, adverse drug event detection, etc. This chapter serves as an introduction to the applications of various text mining approaches in drug discovery. It is divided into two parts with the first half as an overview of text mining in the biosciences. The second half of the chapter reviews strategies and methods for four unique applications of text mining in drug discovery.
Collapse
Affiliation(s)
- Si Zheng
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shazia Dharssi
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Meng Wu
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jiao Li
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
20
|
Lim S, Kang J. Chemical-gene relation extraction using recursive neural network. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5042822. [PMID: 29961818 PMCID: PMC6014134 DOI: 10.1093/database/bay060] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/25/2018] [Indexed: 11/14/2022]
Abstract
In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes. We tested three recursive neural network approaches to improve the performance of relation extraction. In the BioCreative VI challenge, we developed a tree-Long Short-Term Memory networks (tree-LSTM) model with several additional features including a position feature and a subtree containment feature, and we also applied an ensemble method. After the challenge, we applied additional pre-processing steps to the tree-LSTM model, and we tested the performance of another recursive neural network model called Stack-augmented Parser Interpreter Neural Network (SPINN). Our tree-LSTM model achieved an F-score of 58.53% in the BioCreative VI challenge. Our tree-LSTM model with additional pre-processing and the SPINN model obtained F-scores of 63.7 and 64.1%, respectively. Database URL: https://github.com/arwhirang/recursive_chemprot
Collapse
Affiliation(s)
- Sangrak Lim
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Anam-dong 5-ga, Seongbuk-gu, Seoul, South Korea
| |
Collapse
|
21
|
Tawfik NS, Spruit MR. The SNPcurator: literature mining of enriched SNP-disease associations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4925332. [PMID: 29688369 PMCID: PMC5844215 DOI: 10.1093/database/bay020] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Accepted: 02/05/2018] [Indexed: 01/08/2023]
Abstract
The uniqueness of each human genetic structure motivated the shift from the current practice of medicine to a more tailored one. This personalized medicine revolution would not be possible today without the genetics data collected from genome-wide association studies (GWASs) that investigate the relation between different phenotypic traits and single-nucleotide polymorphisms (SNPs). The huge increase in the literature publication space imposes a challenge on the conventional manual curation process which is becoming more and more expensive. This research aims at automatically extracting SNP associations of any given disease and its reported statistical significance (P-value) and odd ratio as well as cohort information such as size and ethnicity. Our evaluation illustrates that SNPcurator was able to replicate a large number of SNP-disease associations that were also reported in the NHGRI-EBI Catalog of published GWASs. SNPcurator was also tested by eight external genetics experts, who queried the system to examine diseases of their choice, and was found to be efficient and satisfactory. We conclude that the text-mining-based system has a great potential for helping researchers and scientists, especially in their preliminary genetics research. SNPcurator is publicly available at http://snpcurator.science.uu.nl/. Database URL: http://snpcurator.science.uu.nl/
Collapse
Affiliation(s)
- Noha S Tawfik
- Computer Engineering Department, College of Engineering, Arab Academy for Science, Technology, and Maritime Transport (AAST), Abukir,1029 Alexandria, Egypt.,Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| | - Marco R Spruit
- Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| |
Collapse
|
22
|
Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018; 34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open
Abstract
Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact zhiyong.lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Juliana Feltz
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Rama Maiti
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Tim Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
23
|
Kordopati V, Salhi A, Razali R, Radovanovic A, Tifratene F, Uludag M, Li Y, Bokhari A, AlSaieedi A, Bin Raies A, Van Neste C, Essack M, Bajic VB. DES-Mutation: System for Exploring Links of Mutations and Diseases. Sci Rep 2018; 8:13359. [PMID: 30190574 PMCID: PMC6127254 DOI: 10.1038/s41598-018-31439-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 08/17/2018] [Indexed: 12/17/2022] Open
Abstract
During cellular division DNA replicates and this process is the basis for passing genetic information to the next generation. However, the DNA copy process sometimes produces a copy that is not perfect, that is, one with mutations. The collection of all such mutations in the DNA copy of an organism makes it unique and determines the organism’s phenotype. However, mutations are often the cause of diseases. Thus, it is useful to have the capability to explore links between mutations and disease. We approached this problem by analyzing a vast amount of published information linking mutations to disease states. Based on such information, we developed the DES-Mutation knowledgebase which allows for exploration of not only mutation-disease links, but also links between mutations and concepts from 27 topic-specific dictionaries such as human genes/proteins, toxins, pathogens, etc. This allows for a more detailed insight into mutation-disease links and context. On a sample of 600 mutation-disease associations predicted and curated, our system achieves precision of 72.83%. To demonstrate the utility of DES-Mutation, we provide case studies related to known or potentially novel information involving disease mutations. To our knowledge, this is the first mutation-disease knowledgebase dedicated to the exploration of this topic through text-mining and data-mining of different mutation types and their associations with terms from multiple thematic dictionaries.
Collapse
Affiliation(s)
- Vasiliki Kordopati
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Adil Salhi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Rozaimi Razali
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Aleksandar Radovanovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Faroug Tifratene
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Yu Li
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ameerah Bokhari
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ahdab AlSaieedi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,King Abdulaziz University (KAU), Faculty of Applied Medical Sciences (FAMS), Department of Medical Laboratory Technology (MLT), Jeddah, 21589-80324, Saudi Arabia
| | - Arwa Bin Raies
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000, Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
24
|
Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 2018; 14:e1006390. [PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/23/2018] [Accepted: 07/24/2018] [Indexed: 11/18/2022] Open
Abstract
Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. As the volume of literature on genomic variants continues to grow at an increasing rate, it is becoming more difficult for a curator of a variant knowledge base to keep up with and curate all the published papers. Here, we suggest a deep learning-based literature triage method for genomic variation resources. Our method achieves state-of-the-art performance on the triage task. Moreover, our model does not require any laborious preprocessing or feature engineering steps, which are required for traditional machine learning triage methods. We applied our method to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with the database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | | | - Aoife McMahon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Jacqueline Ann Langdon MacArthur
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Sylvain Poux
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Lionel Breuza
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Alan Bridge
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Ioannis Xenarios
- Center for Integrative Genomics, University of Lausanne, Lausanne Switzerland.,Department of Chemistry and Biochemistry, University of Geneva, Geneva, Switzerland
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
25
|
Lee K, Kim B, Choi Y, Kim S, Shin W, Lee S, Park S, Kim S, Tan AC, Kang J. Deep learning of mutation-gene-drug relations from the literature. BMC Bioinformatics 2018; 19:21. [PMID: 29368597 PMCID: PMC5784504 DOI: 10.1186/s12859-018-2029-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Accepted: 01/17/2018] [Indexed: 12/31/2022] Open
Abstract
Background Molecular biomarkers that can predict drug efficacy in cancer patients are crucial components for the advancement of precision medicine. However, identifying these molecular biomarkers remains a laborious and challenging task. Next-generation sequencing of patients and preclinical models have increasingly led to the identification of novel gene-mutation-drug relations, and these results have been reported and published in the scientific literature. Results Here, we present two new computational methods that utilize all the PubMed articles as domain specific background knowledge to assist in the extraction and curation of gene-mutation-drug relations from the literature. The first method uses the Biomedical Entity Search Tool (BEST) scoring results as some of the features to train the machine learning classifiers. The second method uses not only the BEST scoring results, but also word vectors in a deep convolutional neural network model that are constructed from and trained on numerous documents such as PubMed abstracts and Google News articles. Using the features obtained from both the BEST search engine scores and word vectors, we extract mutation-gene and mutation-drug relations from the literature using machine learning classifiers such as random forest and deep convolutional neural networks. Our methods achieved better results compared with the state-of-the-art methods. We used our proposed features in a simple machine learning model, and obtained F1-scores of 0.96 and 0.82 for mutation-gene and mutation-drug relation classification, respectively. We also developed a deep learning classification model using convolutional neural networks, BEST scores, and the word embeddings that are pre-trained on PubMed or Google News data. Using deep learning, the classification accuracy improved, and F1-scores of 0.96 and 0.86 were obtained for the mutation-gene and mutation-drug relations, respectively. Conclusion We believe that our computational methods described in this research could be used as an important tool in identifying molecular biomarkers that predict drug responses in cancer patients. We also built a database of these mutation-gene-drug relations that were extracted from all the PubMed abstracts. We believe that our database can prove to be a valuable resource for precision medicine researchers. Electronic supplementary material The online version of this article (10.1186/s12859-018-2029-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kyubum Lee
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Byounggun Kim
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, South Korea
| | - Yonghwa Choi
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Wonho Shin
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, South Korea
| | - Sunwon Lee
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Seongsoon Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Aik Choon Tan
- Translational Bioinformatics and Cancer Systems Biology Laboratory, Division of Medical Oncology, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea. .,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, South Korea.
| |
Collapse
|
26
|
Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018; 2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023]
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
Collapse
Affiliation(s)
- Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| |
Collapse
|
27
|
Khordad M, Mercer RE. Identifying genotype-phenotype relationships in biomedical text. J Biomed Semantics 2017; 8:57. [PMID: 29212530 PMCID: PMC5719522 DOI: 10.1186/s13326-017-0163-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 10/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One important type of information contained in biomedical research literature is the newly discovered relationships between phenotypes and genotypes. Because of the large quantity of literature, a reliable automatic system to identify this information for future curation is essential. Such a system provides important and up to date data for database construction and updating, and even text summarization. In this paper we present a machine learning method to identify these genotype-phenotype relationships. No large human-annotated corpus of genotype-phenotype relationships currently exists. So, a semi-automatic approach has been used to annotate a small labelled training set and a self-training method is proposed to annotate more sentences and enlarge the training set. RESULTS The resulting machine-learned model was evaluated using a separate test set annotated by an expert. The results show that using only the small training set in a supervised learning method achieves good results (precision: 76.47, recall: 77.61, F-measure: 77.03) which are improved by applying a self-training method (precision: 77.70, recall: 77.84, F-measure: 77.77). CONCLUSIONS Relationships between genotypes and phenotypes is biomedical information pivotal to the understanding of a patient's situation. Our proposed method is the first attempt to make a specialized system to identify genotype-phenotype relationships in biomedical literature. We achieve good results using a small training set. To improve the results other linguistic contexts need to be explored and an appropriately enlarged training set is required.
Collapse
Affiliation(s)
- Maryam Khordad
- Department of Computer Science, University of Western Ontario, 1151 Richmond Street, London, N6A 5B7 Canada
| | - Robert E. Mercer
- Department of Computer Science, University of Western Ontario, 1151 Richmond Street, London, N6A 5B7 Canada
| |
Collapse
|
28
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
29
|
Abstract
Deciphering gene–disease association is a crucial step in designing therapeutic strategies against diseases. There are experimental methods for identifying gene–disease associations, such as genome-wide association studies and linkage analysis, but these can be expensive and time consuming. As a result, various
in silico methods for predicting associations from these and other data have been developed using different approaches. In this article, we review some of the recent approaches to the computational prediction of gene–disease association. We look at recent advancements in algorithms, categorising them into those based on genome variation, networks, text mining, and crowdsourcing. We also look at some of the challenges faced in the computational prediction of gene–disease associations.
Collapse
Affiliation(s)
- Kenneth Opap
- University of Cape Town, Cape Town, South Africa
| | | |
Collapse
|
30
|
Bokharaeian B, Diaz A, Taghizadeh N, Chitsaz H, Chavoshinejad R. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. J Biomed Semantics 2017; 8:14. [PMID: 28388928 PMCID: PMC5383945 DOI: 10.1186/s13326-017-0116-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 01/13/2017] [Indexed: 11/17/2022] Open
Abstract
Background Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations. Method In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks. Result The agreement between annotators was measured by Cohen’s Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639. Conclusion Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations. Trial Registration: Not Applicable Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0116-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Behrouz Bokharaeian
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain.
| | - Alberto Diaz
- Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain
| | - Nasrin Taghizadeh
- School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Hamidreza Chitsaz
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| | - Ramyar Chavoshinejad
- External Collaborator, Reproductive Biomedicine Research Center, Royan Institute for Reproductive Biomedicine, Tehran, Iran
| |
Collapse
|
31
|
Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol 2016; 12:e1005017. [PMID: 27902695 PMCID: PMC5130168 DOI: 10.1371/journal.pcbi.1005017] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 06/04/2016] [Indexed: 11/23/2022] Open
Abstract
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. To provide personalized health care it is important to understand patients’ genomic variations and the effect these variants have in protecting or predisposing patients to disease. Several projects aim at providing this information by manually curating such genotype-phenotype relationships in organized databases using data from clinical trials and biomedical literature. However, the exponentially increasing size of biomedical literature and the limited ability of manual curators to discover the genotype-phenotype relationships “hidden” in text has led to delays in keeping such databases updated with the current findings. The result is a bottleneck in leveraging valuable information that is currently available to develop personalized health care solutions. In the past, a few computational techniques have attempted to speed up the curation efforts by using text mining techniques to automatically mine genotype-phenotype information from biomedical literature. However, such computational approaches have not been able to achieve accuracy levels sufficient to make them appealing for practical use. In this work, we present a highly accurate machine-learning-based text mining approach for mining complete genotype-phenotype relationships from biomedical literature. We test the performance of this approach on ten well-known diseases and demonstrate the validity of our approach and its potential utility for practical purposes. We are currently working towards generating genotype-phenotype relationships for all PubMed data with the goal of developing an exhaustive database of all the known diseases in life science. We believe that this work will provide very important and needed support for implementation of personalized health care using genomic data.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
32
|
Lee HC, Hsu YY, Kao HY. AuDis: an automatic CRF-enhanced disease normalization in biomedical text. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw091. [PMID: 27278815 PMCID: PMC4897593 DOI: 10.1093/database/baw091] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/09/2016] [Indexed: 01/22/2023]
Abstract
Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature.Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html.
Collapse
Affiliation(s)
- Hsin-Chun Lee
- Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Yi-Yu Hsu
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Hung-Yu Kao
- Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan, R.O.C Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| |
Collapse
|
33
|
Matos S, Campos D, Pinho R, Silva RM, Mort M, Cooper DN, Oliveira JL. Mining clinical attributes of genomic variants through assisted literature curation in Egas. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw096. [PMID: 27278817 PMCID: PMC4897594 DOI: 10.1093/database/baw096] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/15/2016] [Indexed: 01/08/2023]
Abstract
The veritable deluge of biological data over recent years has led to the establishment of a considerable number of knowledge resources that compile curated information extracted from the literature and store it in structured form, facilitating its use and exploitation. In this article, we focus on the curation of inherited genetic variants and associated clinical attributes, such as zygosity, penetrance or inheritance mode, and describe the use of Egas for this task. Egas is a web-based platform for text-mining assisted literature curation that focuses on usability through modern design solutions and simple user interactions. Egas offers a flexible and customizable tool that allows defining the concept types and relations of interest for a given annotation task, as well as the ontologies used for normalizing each concept type. Further, annotations may be performed on raw documents or on the results of automated concept identification and relation extraction tools. Users can inspect, correct or remove automatic text-mining results, manually add new annotations, and export the results to standard formats. Egas is compatible with the most recent versions of Google Chrome, Mozilla Firefox, Internet Explorer and Safari and is available for use at https://demo.bmd-software.com/egas/. Database URL: https://demo.bmd-software.com/egas/
Collapse
Affiliation(s)
- Sérgio Matos
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - Renato Pinho
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | - Raquel M Silva
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal Department of Medical Sciences, iBiMED, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | | |
Collapse
|
34
|
Thomas P, Rocktäschel T, Hakenberg J, Lichtblau Y, Leser U. SETH detects and normalizes genetic variants in text. Bioinformatics 2016; 32:2883-5. [PMID: 27256315 DOI: 10.1093/bioinformatics/btw234] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2015] [Accepted: 04/18/2016] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Philippe Thomas
- Language Technology Lab, DFKI Berlin, Germany Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | | | - Jörg Hakenberg
- Illumina, Inc, 451 El Camino Real, Santa Clara, CA 95050, USA
| | - Yvonne Lichtblau
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| |
Collapse
|
35
|
Singhal A, Simmons M, Lu Z. Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. J Am Med Inform Assoc 2016; 23:766-72. [PMID: 27121612 DOI: 10.1093/jamia/ocw041] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Accepted: 02/19/2016] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Identifying disease-mutation relationships is a significant challenge in the advancement of precision medicine. The aim of this work is to design a tool that automates the extraction of disease-related mutations from biomedical text to advance database curation for the support of precision medicine. MATERIALS AND METHODS We developed a machine-learning (ML) based method to automatically identify the mutations mentioned in the biomedical literature related to a particular disease. In order to predict a relationship between the mutation and the target disease, several features, such as statistical features, distance features, and sentiment features, were constructed. Our ML model was trained with a pre-labeled dataset consisting of manually curated information about mutation-disease associations. The model was subsequently used to extract disease-related mutations from larger biomedical literature corpora. RESULTS The performance of the proposed approach was assessed using a benchmarking dataset. Results show that our proposed approach gains significant improvement over the previous state of the art and obtains F-measures of 0.880 and 0.845 for prostate and breast cancer mutations, respectively. DISCUSSION To demonstrate its utility, we applied our approach to all abstracts in PubMed for 3 diseases (including a non-cancer disease). The mutations extracted were then manually validated against human-curated databases. The validation results show that the proposed approach is useful in a real-world setting to extract uncurated disease mutations from the biomedical literature. CONCLUSIONS The proposed approach improves the state of the art for mutation-disease extraction from text. It is scalable and generalizable to identify mutations for any disease at a PubMed scale.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, MD, USA
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
36
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
37
|
Lee K, Lee S, Park S, Kim S, Kim S, Choi K, Tan AC, Kang J. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw043. [PMID: 27074804 PMCID: PMC4830473 DOI: 10.1093/database/baw043] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Accepted: 03/09/2016] [Indexed: 12/31/2022]
Abstract
Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance. Database URL: http://infos.korea.ac.kr/bronco
Collapse
Affiliation(s)
- Kyubum Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sunwon Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sungjoon Park
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Suhkyung Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Kwanghun Choi
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| | - Aik Choon Tan
- Translational Bioinformatics and Cancer Systems Biology Laboratory, Division of Medical Oncology, Department of Medicine, University of Colorado Anschutz Medical Campus, 12801 East 17th Avenue Aurora, CO 80045, USA
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Korea and
| |
Collapse
|
38
|
Wei CH, Leaman R, Lu Z. Beyond accuracy: creating interoperable and scalable text-mining web services. Bioinformatics 2016; 32:1907-10. [PMID: 26883486 DOI: 10.1093/bioinformatics/btv760] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 12/21/2015] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g. scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have preprocessed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. AVAILABILITY AND IMPLEMENTATION Our text-mining web service is freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl CONTACT : Zhiyong.Lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
39
|
Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 939:139-166. [PMID: 27807747 DOI: 10.1007/978-981-10-1503-8_7] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Collapse
|
40
|
Li G, Ross KE, Arighi CN, Peng Y, Wu CH, Vijay-Shanker K. miRTex: A Text Mining System for miRNA-Gene Relation Extraction. PLoS Comput Biol 2015; 11:e1004391. [PMID: 26407127 PMCID: PMC4583433 DOI: 10.1371/journal.pcbi.1004391] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Accepted: 06/08/2015] [Indexed: 12/27/2022] Open
Abstract
MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes. MicroRNAs (miRNAs) are an important class of RNAs that regulate a wide range of biological processes by post-transcriptional regulation of gene expression. The amount of literature describing experimentally validated miRNA targets is increasing rapidly, which poses a challenge to researchers and biocurators to stay up-to-date with the available information. Text mining methods have been used to extract miRNA-gene associated pairs and assist in curation. In this paper, we describe miRTex, a text mining system that extracts miRNA-target, miRNA-gene regulation and gene-miRNA regulation relations. We evaluate miRTex performance on two corpora, and show that the elaborate use of lexico-syntactic information and linguistic generalizations enables it to achieve the state-of-the-art performance. We have processed the all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset with miRTex, and provide a website to access the extraction results from all the Medline abstracts. The full-scale text mining results will be a useful resource for miRNA researchers, while the miRTex tool itself can be integrated into literature-based curation pipelines. We present two use cases (for animal and plant miRNAs, respectively) that show how the full-scale text mining can be used in combination with other bioinformatics resources to gain insight into biological processes.
Collapse
Affiliation(s)
- Gang Li
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Karen E. Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - Cecilia N. Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - Yifan Peng
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| | - Cathy H. Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
41
|
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BIOMED RESEARCH INTERNATIONAL 2015; 2015:918710. [PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710] [Citation(s) in RCA: 111] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 02/01/2023]
Abstract
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.
Collapse
|
42
|
Ravikumar KE, Wagholikar KB, Li D, Kocher JP, Liu H. Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature. BMC Bioinformatics 2015; 16:185. [PMID: 26047637 PMCID: PMC4457984 DOI: 10.1186/s12859-015-0609-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2014] [Accepted: 04/30/2015] [Indexed: 12/03/2022] Open
Abstract
Background Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains ‘locked’ in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. Results We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3 % for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10 % in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5 %. Conclusions Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.
Collapse
Affiliation(s)
- Komandur Elayavilli Ravikumar
- Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First St SW, Harvick 3rd, Rochester, MN, 55905, USA.
| | - Kavishwar B Wagholikar
- Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First St SW, Harvick 3rd, Rochester, MN, 55905, USA.
| | - Dingcheng Li
- Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First St SW, Harvick 3rd, Rochester, MN, 55905, USA.
| | - Jean-Pierre Kocher
- Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First St SW, Harvick 3rd, Rochester, MN, 55905, USA.
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First St SW, Harvick 3rd, Rochester, MN, 55905, USA.
| |
Collapse
|
43
|
Khare R, Good BM, Leaman R, Su AI, Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2015; 17:23-32. [PMID: 25888696 DOI: 10.1093/bib/bbv021] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The use of crowdsourcing to solve important but complex problems in biomedical and clinical sciences is growing and encompasses a wide variety of approaches. The crowd is diverse and includes online marketplace workers, health information seekers, science enthusiasts and domain experts. In this article, we review and highlight recent studies that use crowdsourcing to advance biomedicine. We classify these studies into two broad categories: (i) mining big data generated from a crowd (e.g. search logs) and (ii) active crowdsourcing via specific technical platforms, e.g. labor markets, wikis, scientific games and community challenges. Through describing each study in detail, we demonstrate the applicability of different methods in a variety of domains in biomedical research, including genomics, biocuration and clinical research. Furthermore, we discuss and highlight the strengths and limitations of different crowdsourcing platforms. Finally, we identify important emerging trends, opportunities and remaining challenges for future crowdsourcing research in biomedicine.
Collapse
|
44
|
The HIV mutation browser: a resource for human immunodeficiency virus mutagenesis and polymorphism data. PLoS Comput Biol 2014; 10:e1003951. [PMID: 25474213 PMCID: PMC4256008 DOI: 10.1371/journal.pcbi.1003951] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 09/29/2014] [Indexed: 01/19/2023] Open
Abstract
Huge research effort has been invested over many years to determine the phenotypes of natural or artificial mutations in HIV proteins--interpretation of mutation phenotypes is an invaluable source of new knowledge. The results of this research effort are recorded in the scientific literature, but it is difficult for virologists to rapidly find it. Manually locating data on phenotypic variation within the approximately 270,000 available HIV-related research articles, or the further 1,500 articles that are published each month is a daunting task. Accordingly, the HIV research community would benefit from a resource cataloguing the available HIV mutation literature. We have applied computational text-mining techniques to parse and map mutagenesis and polymorphism information from the HIV literature, have enriched the data with ancillary information and have developed a public, web-based interface through which it can be intuitively explored: the HIV mutation browser. The current release of the HIV mutation browser describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers. The mutation information for each protein is organised in a residue-centric manner and each residue is linked to the relevant experimental literature. The importance of HIV as a global health burden advocates extensive effort to maximise the efficiency of HIV research. The HIV mutation browser provides a valuable new resource for the research community. The HIV mutation browser is available at: http://hivmut.org.
Collapse
|
45
|
Macintyre G, Jimeno Yepes A, Ong CS, Verspoor K. Associating disease-related genetic variants in intergenic regions to the genes they impact. PeerJ 2014; 2:e639. [PMID: 25374782 PMCID: PMC4217187 DOI: 10.7717/peerj.639] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2014] [Accepted: 10/07/2014] [Indexed: 11/20/2022] Open
Abstract
We present a method to assist in interpretation of the functional impact of intergenic disease-associated SNPs that is not limited to search strategies proximal to the SNP. The method builds on two sources of external knowledge: the growing understanding of three-dimensional spatial relationships in the genome, and the substantial repository of information about relationships among genetic variants, genes, and diseases captured in the published biomedical literature. We integrate chromatin conformation capture data (HiC) with literature support to rank putative target genes of intergenic disease-associated SNPs. We demonstrate that this hybrid method outperforms a genomic distance baseline on a small test set of expression quantitative trait loci, as well as either method individually. In addition, we show the potential for this method to uncover relationships between intergenic SNPs and target genes across chromosomes. With more extensive chromatin conformation capture data becoming readily available, this method provides a way forward towards functional interpretation of SNPs in the context of the three dimensional structure of the genome in the nucleus.
Collapse
Affiliation(s)
- Geoff Macintyre
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Centre for Neural Engineering, The University of Melbourne, VIC, Australia
| | - Antonio Jimeno Yepes
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
| | - Cheng Soon Ong
- Department of Electrical and Electronic Engineering, The University of Melbourne, VIC, Australia
- Machine Learning Group, NICTA Canberra Research Laboratory, Australia
- Research School of Computer Science, Australian National University, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Health and Biomedical Informatics Centre, The University of Melbourne, VIC, Australia
| |
Collapse
|
46
|
Burger JD, Doughty E, Khare R, Wei CH, Mishra R, Aberdeen J, Tresner-Kirsch D, Wellner B, Kann MG, Lu Z, Hirschman L. Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau094. [PMID: 25246425 PMCID: PMC4170591 DOI: 10.1093/database/bau094] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Background: This article describes capture of biological information using a hybrid approach that combines natural language processing to extract biological entities and crowdsourcing with annotators recruited via Amazon Mechanical Turk to judge correctness of candidate biological relations. These techniques were applied to extract gene– mutation relations from biomedical abstracts with the goal of supporting production scale capture of gene–mutation–disease findings as an open source resource for personalized medicine. Results: The hybrid system could be configured to provide good performance for gene–mutation extraction (precision ∼82%; recall ∼70% against an expert-generated gold standard) at a cost of $0.76 per abstract. This demonstrates that crowd labor platforms such as Amazon Mechanical Turk can be used to recruit quality annotators, even in an application requiring subject matter expertise; aggregated Turker judgments for gene–mutation relations exceeded 90% accuracy. Over half of the precision errors were due to mismatches against the gold standard hidden from annotator view (e.g. incorrect EntrezGene identifier or incorrect mutation position extracted), or incomplete task instructions (e.g. the need to exclude nonhuman mutations). Conclusions: The hybrid curation model provides a readily scalable cost-effective approach to curation, particularly if coupled with expert human review to filter precision errors. We plan to generalize the framework and make it available as open source software. Database URL:http://www.mitre.org/publications/technical-papers/hybrid-curation-of-gene-mutation-relations-combining-automated
Collapse
Affiliation(s)
- John D Burger
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Emily Doughty
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ritu Khare
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Chih-Hsuan Wei
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Rajashree Mishra
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - John Aberdeen
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - David Tresner-Kirsch
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Ben Wellner
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Maricel G Kann
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Zhiyong Lu
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| | - Lynette Hirschman
- The MITRE Corporation, Bedford, MA 01730, USA, Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and The University of Maryland, Baltimore County, Baltimore MD 21250, USA
| |
Collapse
|
47
|
Klein A, Riazanov A, Hindle MM, Baker CJO. Benchmarking infrastructure for mutation text mining. J Biomed Semantics 2014; 5:11. [PMID: 24568600 PMCID: PMC3939821 DOI: 10.1186/2041-1480-5-11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Accepted: 02/05/2014] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. RESULTS We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. CONCLUSION We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.
Collapse
Affiliation(s)
- Artjom Klein
- Computer Science And Applied Statistics Department, University of New Brunswick, Saint John, Canada
| | | | - Matthew M Hindle
- Synthetic and Systems Biology, Edinburgh University, Edinburgh, UK
| | - Christopher JO Baker
- Computer Science And Applied Statistics Department, University of New Brunswick, Saint John, Canada
| |
Collapse
|
48
|
Jimeno Yepes A, Verspoor K. Literature mining of genetic variants for curation: quantifying the importance of supplementary material. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau003. [PMID: 24520105 PMCID: PMC3920087 DOI: 10.1093/database/bau003] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains 'all of the information', and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia and Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | | |
Collapse
|
49
|
Jimeno Yepes A, Verspoor K. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res 2014; 3:18. [PMID: 25285203 PMCID: PMC4176422 DOI: 10.12688/f1000research.3-18.v2] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/27/2014] [Indexed: 11/20/2022] Open
Abstract
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
50
|
Peterson TA, Doughty E, Kann MG. Towards precision medicine: advances in computational approaches for the analysis of human variants. J Mol Biol 2013; 425:4047-63. [PMID: 23962656 PMCID: PMC3807015 DOI: 10.1016/j.jmb.2013.08.008] [Citation(s) in RCA: 93] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Revised: 08/07/2013] [Accepted: 08/08/2013] [Indexed: 12/26/2022]
Abstract
Variations and similarities in our individual genomes are part of our history, our heritage, and our identity. Some human genomic variants are associated with common traits such as hair and eye color, while others are associated with susceptibility to disease or response to drug treatment. Identifying the human variations producing clinically relevant phenotypic changes is critical for providing accurate and personalized diagnosis, prognosis, and treatment for diseases. Furthermore, a better understanding of the molecular underpinning of disease can lead to development of new drug targets for precision medicine. Several resources have been designed for collecting and storing human genomic variations in highly structured, easily accessible databases. Unfortunately, a vast amount of information about these genetic variants and their functional and phenotypic associations is currently buried in the literature, only accessible by manual curation or sophisticated text text-mining technology to extract the relevant information. In addition, the low cost of sequencing technologies coupled with increasing computational power has enabled the development of numerous computational methodologies to predict the pathogenicity of human variants. This review provides a detailed comparison of current human variant resources, including HGMD, OMIM, ClinVar, and UniProt/Swiss-Prot, followed by an overview of the computational methods and techniques used to leverage the available data to predict novel deleterious variants. We expect these resources and tools to become the foundation for understanding the molecular details of genomic variants leading to disease, which in turn will enable the promise of precision medicine.
Collapse
Affiliation(s)
- Thomas A Peterson
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| | - Emily Doughty
- Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA
| | - Maricel G Kann
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| |
Collapse
|