1
|
Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024; 2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]
Abstract
Biomedical relation extraction is an ongoing challenge within the natural language processing community. Its application is important for understanding scientific biomedical literature, with many use cases, such as drug discovery, precision medicine, disease diagnosis, treatment optimization and biomedical knowledge graph construction. Therefore, the development of a tool capable of effectively addressing this task holds the potential to improve knowledge discovery by automating the extraction of relations from research manuscripts. The first track in the BioCreative VIII competition extended the scope of this challenge by introducing the detection of novel relations within the literature. This paper describes that our participation system initially focused on jointly extracting and classifying novel relations between biomedical entities. We then describe our subsequent advancement to an end-to-end model. Specifically, we enhanced our initial system by incorporating it into a cascading pipeline that includes a tagger and linker module. This integration enables the comprehensive extraction of relations and classification of their novelty directly from raw text. Our experiments yielded promising results, and our tagger module managed to attain state-of-the-art named entity recognition performance, with a micro F1-score of 90.24, while our end-to-end system achieved a competitive novelty F1-score of 24.59. The code to run our system is publicly available at https://github.com/ieeta-pt/BioNExt. Database URL: https://github.com/ieeta-pt/BioNExt.
Collapse
Affiliation(s)
- Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
2
|
Feng X, Ma Z, Yu C, Xin R. MRNDR: Multihead Attention-Based Recommendation Network for Drug Repurposing. J Chem Inf Model 2024; 64:2654-2669. [PMID: 38373300 DOI: 10.1021/acs.jcim.3c01726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2024]
Abstract
As is well-known, the process of developing new drugs is extremely expensive, whereas drug repurposing represents a promising approach to augment the efficiency of new drug development. While this method can indeed spare us from expensive drug toxicity and safety experiments, it still demands a substantial amount of time to carry out precise efficacy experiments for specific diseases, thereby consuming a significant quantity of resources. Therefore, if we can prescreen potential other indications for selected drugs, it could result in substantial cost savings. In light of this, this paper introduces a drug repurposing recommendation model called MRNDR, which stands for Multi-head attention-based Recommendation Network for Drug Repurposing. This model serves as a prediction tool for drug-disease relationships, leveraging the multihead self-attention mechanism that demonstrates robust generalization capabilities. These capabilities stem not only from our extensive million-level training data set, BioRE (Biology Recommended Entity data), but also from the utilization of the WRDS (Weighted Representation Distance Score) algorithm proposed by us. The MRNDR model has achieved new state-of-the-art results on the GP-KG public data set, with an MRR (Mean Reciprocal Rank) score of 0.308 and a Hits@10 score of 0.628. This represents significant improvements of 4.7% (MRR) and 18.1% (Hits@10) over the current best-performing models. Additionally, to further validate the practical utility of the model, we examined results recommended by MRNDR that were not present in the training data set. Some of these recommendations have undergone clinical trials, as evidenced by their presence on ClinicalTrials.gov and the China Clinical Trials Center, indirectly confirming the applicability of MRNDR. The MRNDR model can predict the reusability of candidate drugs, reducing the need for manual expert assessments and enabling efficient drug repurposing.
Collapse
Affiliation(s)
- Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun 130012, P.R. China
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun 130012, P.R. China
| | - Zhansen Ma
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
| | - Cuinan Yu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Ruihao Xin
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| |
Collapse
|
3
|
Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024; 11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open
Abstract
It is vital to investigate the complex mechanisms underlying tumors to better understand cancer and develop effective treatments. Metabolic abnormalities and clinical phenotypes can serve as essential biomarkers for diagnosing this challenging disease. Additionally, genetic alterations provide profound insights into the fundamental aspects of cancer. This study introduces Cancer-Alterome, a literature-mined dataset that focuses on the regulatory events of an organism's biological processes or clinical phenotypes caused by genetic alterations. By proposing and leveraging a text-mining pipeline, we identify 16,681 thousand of regulatory events records encompassing 21K genes, 157K genetic alterations and 154K downstream bio-concepts, extracted from 4,354K pan-cancer literature. The resulting dataset empowers a multifaceted investigation of cancer pathology, enabling the meticulous tracking of relevant literature support. Its potential applications extend to evidence-based medicine and precision medicine, yielding valuable insights for further advancements in cancer research.
Collapse
Affiliation(s)
- Xinzhi Yao
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Zhihan He
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yawen Liu
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, P.R. China
| | - Sizhuo Ouyang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China.
| |
Collapse
|
4
|
Preston S, Wei M, Rao R, Tinn R, Usuyama N, Lucas M, Gu Y, Weerasinghe R, Lee S, Piening B, Tittel P, Valluri N, Naumann T, Bifulco C, Poon H. Toward structuring real-world data: Deep learning for extracting oncology information from clinical text with patient-level supervision. PATTERNS (NEW YORK, N.Y.) 2023; 4:100726. [PMID: 37123439 PMCID: PMC10140604 DOI: 10.1016/j.patter.2023.100726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 11/11/2022] [Accepted: 03/14/2023] [Indexed: 05/02/2023]
Abstract
Most detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep-learning methods attain test area under the receiver operating characteristic curve (AUROC) values of 94%-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Ablation results demonstrate the superiority of these advanced deep-learning methods. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels.
Collapse
Affiliation(s)
| | - Mu Wei
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | | | - Soohee Lee
- Providence St Joseph’s Health, Portland, OR, USA
| | - Brian Piening
- Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA
| | - Paul Tittel
- Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA
| | | | | | - Carlo Bifulco
- Providence Genomics & Earle A. Chiles Research Institute, Portland, OR, USA
- Corresponding author
| | - Hoifung Poon
- Microsoft Research, Redmond, WA, USA
- Corresponding author
| |
Collapse
|
5
|
Tinn R, Cheng H, Gu Y, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Fine-tuning large neural language models for biomedical natural language processing. PATTERNS (NEW YORK, N.Y.) 2023; 4:100729. [PMID: 37123444 PMCID: PMC10140607 DOI: 10.1016/j.patter.2023.100729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 12/12/2022] [Accepted: 03/17/2023] [Indexed: 05/02/2023]
Abstract
Large neural language models have transformed modern natural language processing (NLP) applications. However, fine-tuning such models for specific tasks remains challenging as model size increases, especially with small labeled datasets, which are common in biomedical NLP. We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that fine-tuning performance may be sensitive to pretraining settings and conduct an exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for low-resource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT- B A S E models, while layerwise decay is more effective for BERT- L A R G E and ELECTRA models. For low-resource text similarity tasks, such as BIOSSES, reinitializing the top layers is the optimal strategy. Overall, domain-specific vocabulary and pretraining facilitate robust models for fine-tuning. Based on these findings, we establish a new state of the art on a wide range of biomedical NLP applications.
Collapse
Affiliation(s)
| | - Hao Cheng
- Microsoft Research, Redmond, WA, USA
| | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | - Hoifung Poon
- Microsoft Research, Redmond, WA, USA
- Corresponding author
| |
Collapse
|
6
|
Chen Q, Du J, Allot A, Lu Z. LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2584-2595. [PMID: 35536809 PMCID: PMC9647722 DOI: 10.1109/tcbb.2022.3173562] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/19/2022] [Accepted: 04/22/2022] [Indexed: 05/20/2023]
Abstract
The rapid growth of biomedical literature poses a significant challenge for curation and interpretation. This has become more evident during the COVID-19 pandemic. LitCovid, a literature database of COVID-19 related papers in PubMed, has accumulated over 200,000 articles with millions of accesses. Approximately 10,000 new articles are added to LitCovid every month. A main curation task in LitCovid is topic annotation where an article is assigned with up to eight topics, e.g., Treatment and Diagnosis. The annotated topics have been widely used both in LitCovid (e.g., accounting for ∼18% of total uses) and downstream studies such as network generation. However, it has been a primary curation bottleneck due to the nature of the task and the rapid literature growth. This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature. It uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs. We compare LITMC-BERT with three baseline models on two datasets. Its micro-F1 and instance-based F1 are 5% and 4% higher than the current best results, respectively, and only requires ∼18% of the inference time than the Binary BERT baseline. The related datasets and models are available via https://github.com/ncbi/ml-transformer.
Collapse
|
7
|
Chen Q, Allot A, Leaman R, Islamaj R, Du J, Fang L, Wang K, Xu S, Zhang Y, Bagherzadeh P, Bergler S, Bhatnagar A, Bhavsar N, Chang YC, Lin SJ, Tang W, Zhang H, Tavchioski I, Pollak S, Tian S, Zhang J, Otmakhova Y, Yepes AJ, Dong H, Wu H, Dufour R, Labrak Y, Chatterjee N, Tandon K, Laleye FAA, Rakotoson L, Chersoni E, Gu J, Friedrich A, Pujari SC, Chizhikova M, Sivadasan N, VG S, Lu Z. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations. Database (Oxford) 2022; 2022:baac069. [PMID: 36043400 PMCID: PMC9428574 DOI: 10.1093/database/baac069] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 08/02/2022] [Accepted: 08/13/2022] [Indexed: 05/03/2023]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature-at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset-consisting of over 30 000 articles with manually reviewed topics-was created for training and testing. It is one of the largest multi-label classification datasets in biomedical scientific literature. Nineteen teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181 and 0.9394 for macro-F1-score, micro-F1-score and instance-based F1-score, respectively. Notably, these scores are substantially higher (e.g. 12%, higher for macro F1-score) than the corresponding scores of the state-of-art multi-label classification method. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Rezarta Islamaj
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| | - Jingcheng Du
- School of Biomedical Informatics, UT Health, TX, Houston 77030, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Shuo Xu
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | - Yuefu Zhang
- College of Economics and Management, Beijing University of Technology, Beijing, QC, China
| | | | | | | | | | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Wentai Tang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongtong Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Ilija Tavchioski
- Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
- Jožef Stefan Institute, Ljubljana, Slovenia
| | | | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yulia Otmakhova
- School of Computing and Information Systems, University of Melbourne, Melbourne, AU-VIC, Australia
| | | | - Hang Dong
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK
| | | | | | - Niladri Chatterjee
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | - Kushagri Tandon
- Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India
| | | | | | - Emmanuele Chersoni
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | - Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | | | - Subhash Chandra Pujari
- Institute of Computer Science, Heidelberg University, Heidelberg, Germany
- Bosch Center for Artificial Intelligence, Renningen, Germany
| | - Mariia Chizhikova
- SINAI Group, Department of Computer Science, Advanced Studies Center in ICT (CEATIC), Universidad de Jaén, Jaén, Spain
| | | | | | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD, Bethesda 20892, USA
| |
Collapse
|
8
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
9
|
Church K, Liu B. Acronyms and Opportunities for Improving Deep Nets. Front Artif Intell 2022; 4:732381. [PMID: 34988434 PMCID: PMC8721666 DOI: 10.3389/frai.2021.732381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/21/2021] [Indexed: 11/13/2022] Open
Abstract
Recently, several studies have reported promising results with BERT-like methods on acronym tasks. In this study, we find an older rule-based program, Ab3P, not only performs better, but error analysis suggests why. There is a well-known spelling convention in acronyms where each letter in the short form (SF) refers to “salient” letters in the long form (LF). The error analysis uses decision trees and logistic regression to show that there is an opportunity for many pre-trained models (BERT, T5, BioBert, BART, ERNIE) to take advantage of this spelling convention.
Collapse
Affiliation(s)
| | - Boxiang Liu
- Baidu Research, Sunnyvale, CA, United States
| |
Collapse
|
10
|
Yan N, Huang S, Kong C. Extracting Entity Synonymous Relations via Context-Aware Permutation Invariance. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING 2022. [DOI: 10.4018/ijitwe.288039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Discovering entity synonymous relations is an important work for many entity-based applications. Existing entity synonymous relation extraction approaches are mainly based on lexical patterns or distributional corpus-level statistics, ignoring the context semantics between entities. For example, the contexts around ''apple'' determine whether ''apple'' is a kind of fruit or Apple Inc. In this paper, an entity synonymous relation extraction approach is proposed using context-aware permutation invariance. Specifically, a triplet network is used to obtain the permutation invariance between the entities to learn whether two given entities possess synonymous relation. To track more synonymous features, the relational context semantics and entity representations are integrated into the triplet network, which can improve the performance of extracting entity synonymous relations. The proposed approach is implemented on three real-world datasets. Experimental results demonstrate that the approach performs better than the other compared approaches on entity synonymous relation extraction task.
Collapse
Affiliation(s)
- Nan Yan
- Anhui Polytechnic University, China
| | | | | |
Collapse
|
11
|
Zhu T, Qin Y, Xiang Y, Hu B, Chen Q, Peng W. Distantly supervised biomedical relation extraction using piecewise attentive convolutional neural network and reinforcement learning. J Am Med Inform Assoc 2021; 28:2571-2581. [PMID: 34524450 DOI: 10.1093/jamia/ocab176] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 07/08/2021] [Accepted: 08/06/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning. MATERIALS AND METHODS In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks. RESULTS The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of 0.5592 on the may-prevent dataset, 0.6666 on the may-treat dataset, and 0.3838 on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets. CONCLUSIONS The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.
Collapse
Affiliation(s)
- Tiantian Zhu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Yang Qin
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Yang Xiang
- Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Baotian Hu
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Qingcai Chen
- Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.,Department of Network Intelligence, Peng Cheng Laboratory, Shenzhen, China
| | - Weihua Peng
- Department of Knowledge Graph, Baidu International Technology (Shenzhen), Shenzhen, China
| |
Collapse
|
12
|
Grissette H, Nfaoui EH. Affective Concept-Based Encoding of Patient Narratives via Sentic Computing and Neural Networks. Cognit Comput 2021; 14:274-299. [PMID: 34422122 PMCID: PMC8371039 DOI: 10.1007/s12559-021-09903-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 06/23/2021] [Indexed: 11/30/2022]
Abstract
The automatic generation of features without human intervention is the most critical task for biomedical sentiment analysis. Regarding the high dynamicity of shared patient narrative data, the lack of formal medical language sentiment dictionaries prevents retrieval of the appropriate sentiment, which is unapproachable and can be prone to annotator bias. We propose a novel affective biomedical concept-based encoding via sentic computing and neural networks. The main contributions include four aspects. First, a biomedical embedding, in which a medical entity is defined, normalized, and synthesized from a text, is built using online patient narratives after being combined with label propagation from a widely used comprehensive biomedical vocabulary. Second, considering the dependence on biomedical definitions, drug reaction sample selection based on general matching is suggested. These feature settings are then used to build and recognize affective semantics and sentics based on an extreme learning machine. Finally, a semisupervised LSTM-BiLSTM model for biomedical sentiment analysis is constructed. There was a massive influx of patient self-reports related to the COVID-19 pandemic. A study was conducted in this direction, and we tested the validity, medical language familiarity, and transferability of our approach by analyzing millions of COVID-19 tweets. Comparisons to affective lexicons also indicate that integrating extreme learning machine cognitive capabilities has advantages over biomedical sentiment analysis. By considering sentics vectors on top of the formed embeddings, our semisupervised LSTM-BiLSTM achieved an accuracy of 87.5%. The evaluations of unsupervised learning approximated the results of the previous model when dealing with a serious loss of biomedical data. In this paper, we demonstrate the effectiveness of integrating deep-learning-based cognitive capabilities for both enhancing distributed biomedical definitions and inferring sentiment compositions from many patient self-reports on social networks. The relevant encoding of affective information conveyed regarding medication subjects clearly reveals defined roles and expectations that can have a positive impact on public health.
Collapse
Affiliation(s)
- Hanane Grissette
- LISAC Laboratory, Faculty of Sciences Dhar EL Mahraz, Sidi Mohamed Ben Abdellah University, Fez, Morocco
| | - El Habib Nfaoui
- LISAC Laboratory, Faculty of Sciences Dhar EL Mahraz, Sidi Mohamed Ben Abdellah University, Fez, Morocco
| |
Collapse
|
13
|
Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021; 4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Shankai Yan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| |
Collapse
|
14
|
Turina P, Fariselli P, Capriotti E. ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed. Front Mol Biosci 2021; 8:620475. [PMID: 33842537 PMCID: PMC8027235 DOI: 10.3389/fmolb.2021.620475] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/18/2021] [Indexed: 11/13/2022] Open
Abstract
During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.
Collapse
Affiliation(s)
- Paola Turina
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| |
Collapse
|
15
|
Guo S, Huang L, Yao G, Wang Y, Guan H, Bai T. Extracting Biomedical Entity Relations using Biological Interaction Knowledge. Interdiscip Sci 2021; 13:312-320. [PMID: 33730356 DOI: 10.1007/s12539-021-00425-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 02/24/2021] [Accepted: 03/05/2021] [Indexed: 10/21/2022]
Abstract
Discovering relations of cross-type biomedical entities is crucial for biology research. A large amount of potential or indirect connected biological relations is hidden in millions of biomedical literatures and biological databases. The previous rules-based and deep learning approaches rely on plenty of manual annotations, which is laborious, time-consuming and unsatisfactory. It is necessary to be able to combine available annotated gene databases, chemical, genomic, clinical and other types of data repositories as domain knowledge to assist the extraction of biological entity relations from numerous literatures. Under this scenario, this paper proposes BioGraphSAGE model, a Siamese graph neural network with structured databases as domain knowledge to extract biological entity relations from literatures. Our model combines both biological semantic features and positional features to improve the recognition of relations between distant entities in the same literature. The experiment results show that BioGraphSAGE achieves the best F1 score among other relation extraction models on smaller annotated samples. Moreover, the proposed model can still maintain a F1 score of 0.526 without using annotated training samples.
Collapse
Affiliation(s)
- Shuyu Guo
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Lan Huang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Gang Yao
- Department of Neurology, the Second Hospital of Jilin University, Changchun, 130012, China
| | - Ye Wang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Haotian Guan
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Tian Bai
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China. .,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China.
| |
Collapse
|
16
|
Kaushik V, Plazzer J, Macrae F. Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:e10039. [PMID: 36618447 PMCID: PMC9744508 DOI: 10.1002/ggn2.10039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 01/12/2021] [Accepted: 01/14/2021] [Indexed: 01/11/2023]
Abstract
Pathogenic constitutional genomic variants in the mismatch repair (MMR) genes are the drivers of Lynch syndrome; optimal variant interpretation is required for the management of suspected and confirmed cases. The International Society for Hereditary Gastrointestinal Tumours (InSiGHT) provides expert classifications for MMR variants for the US National Human Genome Research Institute's (NHGRI) ClinGen initiative and interprets variants with discordant classifications and those of uncertain significance (VUSs). Given the onerous nature of extracting information related to variants, literature searching tools which harness artificial intelligence may aid in retrieving information to allow optimum variant classification. In this study, we described the nature of discordance in a sample of 80 variants from a list of variants requiring updating by InSiGHT for ClinGen by comparing their existing InSiGHT classifications with the various submissions for each variant on the US National Centre for Biotechnology Information's (NCBI) ClinVar database. To identify the potential value of a literature searching tool in extracting information related to classification, all variants were searched for using a traditional method (Google Scholar) and literature searching tool (Mastermind) independently. Descriptive statistics were used to compare: the number of articles before and after screening for relevance and the number of relevant articles unique to either method. Relevance was defined as containing the variant in question as well as data informing variant interpretation. A total of 916 articles were returned by both methods and Mastermind averaged four relevant articles per search compared to Google Scholar's three. Of relevant Mastermind articles, 193/308 (62.7%) were unique to it, compared to 87/202, (43.0%) for Google Scholar. For 24 variants, either or both methods found no information. All 6/80 (20%) variants with pathogenic or likely pathogenic InSiGHT classifications have newer VUS assertions on ClinVar. Our study demonstrated that for a sample of variants with varying discordant interpretations, Mastermind was able to return on average, a more relevant and unique literature search. Google Scholar was able to retrieve information that Mastermind did not, which supports a conclusion that Mastermind could play a complementary role in literature searching for classification. This work will aid InSiGHT in its role of classifying MMR variants.
Collapse
Affiliation(s)
- Varun Kaushik
- Melbourne Medical SchoolThe University of MelbourneParkvilleVictoriaAustralia,Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - John‐Paul Plazzer
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - Finlay Macrae
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia,Department of Medicine, The Royal Melbourne HospitalThe University of MelbourneParkvilleVictoriaAustralia
| |
Collapse
|
17
|
Tworowski D, Gorohovski A, Mukherjee S, Carmi G, Levy E, Detroja R, Mukherjee SB, Frenkel-Morgenstern M. COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics. Nucleic Acids Res 2021; 49:D1113-D1121. [PMID: 33166390 PMCID: PMC7778969 DOI: 10.1093/nar/gkaa969] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 10/07/2020] [Accepted: 11/04/2020] [Indexed: 12/12/2022] Open
Abstract
The recent outbreak of COVID-19 has generated an enormous amount of Big Data. To date, the COVID-19 Open Research Dataset (CORD-19), lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. According to LitCovid (11 August 2020), ∼40,300 COVID19-related articles are currently listed in PubMed. It has been shown in clinical settings that the analysis of past research results and the mining of available data can provide novel opportunities for the successful application of currently approved therapeutics and their combinations for the treatment of conditions caused by a novel SARS-CoV-2 infection. As such, effective responses to the pandemic require the development of efficient applications, methods and algorithms for data navigation, text-mining, clustering, classification, analysis, and reasoning. Thus, our COVID19 Drug Repository represents a modular platform for drug data navigation and analysis, with an emphasis on COVID-19-related information currently being reported. The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs' action. Our COVID19 drug repository provide a most updated world-wide collection of drugs that has been repurposed for COVID19 treatments around the world.
Collapse
Affiliation(s)
- Dmitry Tworowski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Alessandro Gorohovski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sumit Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Gon Carmi
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Eliad Levy
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Rajesh Detroja
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sunanda Biswas Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Milana Frenkel-Morgenstern
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| |
Collapse
|
18
|
Arango-Argoty GA, Guron GKP, Garner E, Riquelme MV, Heath LS, Pruden A, Vikesland PJ, Zhang L. ARGminer: a web platform for the crowdsourcing-based curation of antibiotic resistance genes. Bioinformatics 2020; 36:2966-2973. [PMID: 32058567 DOI: 10.1093/bioinformatics/btaa095] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 01/31/2020] [Accepted: 02/08/2020] [Indexed: 12/20/2022] Open
Affiliation(s)
| | - G K P Guron
- Department of Civil and Environmental Engineering.,Department of Food Science and Technology, Virginia Tech, Blacksburg, VA 24061 - 0217, USA
| | - E Garner
- Department of Civil and Environmental Engineering
| | - M V Riquelme
- Department of Civil and Environmental Engineering
| | | | - A Pruden
- Department of Civil and Environmental Engineering
| | | | - L Zhang
- Department of Computer Science
| |
Collapse
|
19
|
Méar L, Herr M, Fauconnier A, Pineau C, Vialard F. Polymorphisms and endometriosis: a systematic review and meta-analyses. Hum Reprod Update 2020; 26:73-102. [PMID: 31821471 DOI: 10.1093/humupd/dmz034] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Revised: 08/20/2019] [Accepted: 08/28/2019] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Endometriosis is an estrogen-dependent gynecological disorder that affects at least 10% of women of reproductive age. It may lead to infertility and non-specific symptoms such as chronic pelvic pain. Endometriosis screening and diagnosis are difficult and time-consuming. Late diagnosis (with a delay ranging from 3.3 to 10.7 years) is a major problem and may contribute to disease progression and a worse response to treatment once initiated. Efficient screening tests might reduce this diagnostic delay. As endometriosis is presumed to be a complex disease with several genetic and non-genetic pathogenic factors, many researchers have sought to identify polymorphisms that predispose to this condition. OBJECTIVE AND RATIONALE We performed a systematic review and meta-analysis of the most regularly reported polymorphisms in order to identify those that might predispose to endometriosis and might thus be of value in screening. SEARCH METHODS The MEDLINE database was searched for English-language publications on DNA polymorphisms in endometriosis, with no date restriction. The PubTator text mining tool was used to extract gene names from the selected publications' abstracts. We only selected polymorphisms reported by at least three studies, having applied strict inclusion and exclusion criteria to their control populations. No stratification based on ethnicity was performed. All steps were carried out according to PRISMA guidelines. OUTCOMES The initial selection of 395 publications cited 242 different genes. Sixty-two genes (corresponding to 265 different polymorphisms) were cited at least in three publications. After the application of our other selection criteria (an original case-control study of endometriosis, a reported association between endometriosis and at least one polymorphism, data on women of reproductive age and a diagnosis of endometriosis in the cases established by surgery and/or MRI and confirmed by histology), 28 polymorphisms were eligible for meta-analysis. Only five of the 28 polymorphisms were found to be significantly associated with endometriosis: interferon gamma (IFNG) (CA) repeat, glutathione S-transferase mu 1 (GSTM1) null genotype, glutathione S-transferase pi 1 (GSTP1) rs1695 and wingless-type MMTV integration site family member 4 (WNT4) rs16826658 and rs2235529. Six others showed a significant trend towards an association: progesterone receptor (PGR) PROGINS, interCellular adhesion molecule 1 (ICAM1) rs1799969, aryl-hydrocarbon receptor repressor (AHRR) rs2292596, cytochrome family 17 subfamily A polypeptide 1 (CYP17A1) rs743572, CYP2C19 rs4244285 and peroxisome proliferator-activated receptor gamma (PPARG) rs1801282), and 12 showed a significant trend towards the lack of an association: tumor necrosis factor (TNF) rs1799964, interleukin 6 (IL6) rs1800796, transforming growth factor beta 1 (TGFB1) rs1800469, estrogen receptor 1 (ESR1) rs2234693, PGR rs10895068, FSH receptor (FSHR) rs6166, ICAM1 rs5498, CYP1A1 rs4646903, CYP19A1 rs10046, tumor protein 53 (TP53) rs1042522, X-ray repair complementing defective repair in Chinese hamster cells 1 (XRCC1) rs25487 and serpin peptidase inhibitor clade E member 1 (SERPINE1) rs1799889; however, for the 18 polymorphisms identified in the latter two groups, further studies of the potential association with the endometriosis risk are needed. The remaining five of the 28 polymorphisms were not associated with endometriosis: glutathione S-transferase theta 1 (GSTT1) null genotype, vascular endothelial growth factor alpha (VEGFA) rs699947, rs833061, rs2010963 and rs3025039. WIDER IMPLICATIONS By carefully taking account of how the control populations were defined, we identified polymorphisms that might be candidates for use in endometriosis screening and polymorphisms not associated with endometriosis. This might constitute the first step towards identifying polymorphism combinations that predispose to endometriosis (IFNG (CA) repeat, GSTM1 null genotype, GSTP1 rs1695, WNT4 rs16826658 and WNT4 rs2235529) in a large cohort of patients with well-defined inclusion criteria. In turn, these results might improve the diagnosis of endometriosis in primary care. Lastly, our present findings may enable a better understanding of endometriosis and improve the management of patients with this disease.
Collapse
Affiliation(s)
- Loren Méar
- EA7404-GIG, UFR des Sciences de la Santé Simone Veil, UVSQ, F-78180 Montigny le Bretonneux, France.,Univ Rennes, Inserm, EHESP, Irset, UMR_S 1085, F-35042 Rennes cedex, France.,Protim, Univ Rennes, F-35042 Rennes cedex, France
| | - Marie Herr
- INSERM, U1168, VIMA: Aging and Chronic Diseases, Epidemiological and Public Health Approaches, F-94807 Villejuif, France.,UMR-S 1168, UFR des Sciences de la Santé Simone Veil, UVSQ, F-78180 Montigny le Bretonneux, France.,Département Hospitalier d'Epidémiologie et Santé Publique, Hôpitaux Universitaires Paris Ile-de-France Ouest, Assistance Publique-Hôpitaux de Paris, F-75000 Paris, France
| | - Arnaud Fauconnier
- EA7325-RISQ, UFR des Sciences de la Santé Simone Veil, UVSQ, F-78180 Montigny le Bretonneux, France.,Department of Gyneacology and Obstetrics, CHI de Poissy St Germain en Laye, F-78303 Poissy, France
| | - Charles Pineau
- Univ Rennes, Inserm, EHESP, Irset, UMR_S 1085, F-35042 Rennes cedex, France.,Protim, Univ Rennes, F-35042 Rennes cedex, France
| | - François Vialard
- EA7404-GIG, UFR des Sciences de la Santé Simone Veil, UVSQ, F-78180 Montigny le Bretonneux, France.,Genetics Federation, CHI de Poissy St Germain en Laye, F-78303 Poissy, France
| |
Collapse
|
20
|
Hansson LK, Hansen RB, Pletscher-Frankild S, Berzins R, Hansen DH, Madsen D, Christensen SB, Christiansen MR, Boulund U, Wolf XA, Kjærulff SK, van de Bunt M, Tulin S, Jensen TS, Wernersson R, Jensen JN. Semantic text mining in early drug discovery for type 2 diabetes. PLoS One 2020; 15:e0233956. [PMID: 32542027 PMCID: PMC7295186 DOI: 10.1371/journal.pone.0233956] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 05/15/2020] [Indexed: 11/18/2022] Open
Abstract
Background Surveying the scientific literature is an important part of early drug discovery; and with the ever-increasing amount of biomedical publications it is imperative to focus on the most interesting articles. Here we present a project that highlights new understanding (e.g. recently discovered modes of action) and identifies potential drug targets, via a novel, data-driven text mining approach to score type 2 diabetes (T2D) relevance. We focused on monitoring trends and jumps in T2D relevance to help us be timely informed of important breakthroughs. Methods We extracted over 7 million n-grams from PubMed abstracts and then clustered around 240,000 linked to T2D into almost 50,000 T2D relevant ‘semantic concepts’. To score papers, we weighted the concepts based on co-mentioning with core T2D proteins. A protein’s T2D relevance was determined by combining the scores of the papers mentioning it in the five preceding years. Each week all proteins were ranked according to their T2D relevance. Furthermore, the historical distribution of changes in rank from one week to the next was used to calculate the significance of a change in rank by T2D relevance for each protein. Results We show that T2D relevant papers, even those not mentioning T2D explicitly, were prioritised by relevant semantic concepts. Well known T2D proteins were therefore enriched among the top scoring proteins. Our ‘high jumpers’ identified important past developments in the apprehension of how certain key proteins relate to T2D, indicating that our method will make us aware of future breakthroughs. In summary, this project facilitated keeping up with current T2D research by repeatedly providing short lists of potential novel targets into our early drug discovery pipeline.
Collapse
Affiliation(s)
- Lena K. Hansson
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Oxford, United Kingdom
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Rasmus Wernersson
- Intomics A/S, Kgs. Lyngby, Denmark
- DTU Health Tech, Technical University of Denmark, Kgs. Lyngby, Denmark
- * E-mail:
| | - Jan Nygaard Jensen
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Oxford, United Kingdom
| |
Collapse
|
21
|
Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Automated recognition of functional compound-protein relationships in literature. PLoS One 2020; 15:e0220925. [PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/29/2020] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task. METHOD We created a new benchmark dataset of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non-functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated. RESULTS The cross-validation of the all-paths graph kernel (AUC value: 84.6%, F1 score: 79.0%) shows slightly better results than the shallow linguistic kernel (AUC value: 82.5%, F1 score: 77.2%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could further increase the overall performance slightly. We used each of the two kernels to identify functional relationships in all PubMed abstracts (29 million) and provide the results, including recorded processing time. AVAILABILITY The software for the tested kernels, the benchmark, the processed 29 million PubMed abstracts, all evaluation scripts, as well as the scripts for processing the complete PubMed database are freely available at https://github.com/KerstenDoering/CPI-Pipeline.
Collapse
Affiliation(s)
- Kersten Döring
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Ammar Qaseem
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Michael Becer
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Jianyu Li
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pankaj Mishra
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Mingjie Gao
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pascal Kirchner
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Florian Sauter
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Kiran K. Telukunta
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Aurélien F. A. Moumbock
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | | | - Stefan Günther
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
- * E-mail:
| |
Collapse
|
22
|
Boland MR, Kashyap A, Xiong J, Holmes J, Lorch S. Development and validation of the PEPPER framework (Prenatal Exposure PubMed ParsER) with applications to food additives. J Am Med Inform Assoc 2019; 25:1432-1443. [PMID: 30371821 PMCID: PMC6213088 DOI: 10.1093/jamia/ocy119] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Accepted: 08/13/2018] [Indexed: 11/14/2022] Open
Abstract
Background Globally, 36% of deaths among children can be attributed to environmental factors. However, no comprehensive list of environmental exposures exists. We seek to address this gap by developing a literature-mining algorithm to catalog prenatal environmental exposures. Methods We designed a framework called. PEPPER Prenatal Exposure PubMed ParsER to a) catalog prenatal exposures studied in the literature and b) identify study type. Using PubMed Central, PEPPER classifies article type (methodology, systematic review) and catalogs prenatal exposures. We coupled PEPPER with the FDA's food additive database to form a master set of exposures. Results We found that of 31 764 prenatal exposure studies only 53.0% were methodology studies. PEPPER consists of 219 prenatal exposures, including a common set of 43 exposures. PEPPER captured prenatal exposures from 56.4% of methodology studies (9492/16 832 studies). Two raters independently reviewed 50 randomly selected articles and annotated presence of exposures and study methodology type. Error rates for PEPPER's exposure assignment ranged from 0.56% to 1.30% depending on the rater. Evaluation of the study type assignment showed agreement ranging from 96% to 100% (kappa = 0.909, p < .001). Using a gold-standard set of relevant prenatal exposure studies, PEPPER achieved a recall of 94.4%. Conclusions Using curated exposures and food additives; PEPPER provides the first comprehensive list of 219 prenatal exposures studied in methodology papers. On average, 1.45 exposures were investigated per study. PEPPER successfully distinguished article type for all prenatal studies allowing literature gaps to be easily identified.
Collapse
Affiliation(s)
- Mary Regina Boland
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Center for Excellence in Environmental Toxicology, University of Pennsylvania, Philadelphia, PA, USA.,Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Aditya Kashyap
- Data Science Masters Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiadi Xiong
- Data Science Masters Program, University of Pennsylvania, Philadelphia, PA, USA
| | - John Holmes
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Scott Lorch
- Division of Neonatology, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| |
Collapse
|
23
|
Sun Y, Hou L, Qin L, Liu Y, Li J, Qian Q. RCorp: a resource for chemical disease semantic extraction in Chinese. BMC Med Inform Decis Mak 2019; 19:234. [PMID: 31801523 PMCID: PMC6894109 DOI: 10.1186/s12911-019-0936-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND To robustly identify synergistic combinations of drugs, high-throughput screenings are desirable. It will be of great help to automatically identify the relations in the published papers with machine learning based tools. To support the chemical disease semantic relation extraction especially for chronic diseases, a chronic disease specific corpus for combination therapy discovery in Chinese (RCorp) is manually annotated. METHODS In this study, we extracted abstracts from a Chinese medical literature server and followed the annotation framework of the BioCreative CDR corpus, with the guidelines modified to make the combination therapy related relations available. An annotation tool was incorporated to the standard annotation process. RESULTS The resulting RCorp consists of 339 Chinese biomedical articles with 2367 annotated chemicals, 2113 diseases, 237 symptoms, 164 chemical-induce-disease relations, 163 chemical-induce-symptom relations, and 805 chemical-treat-disease relations. Each annotation includes both the mention text spans and normalized concept identifiers. The corpus gets an inter-annotator agreement score of 0.883 for chemical entities, 0.791 for disease entities which are measured by F score. And the F score for chemical-treat-disease relations gets 0.788 after unifying the entity mentions. CONCLUSIONS We extracted and manually annotated a chronic disease specific corpus for combination therapy discovery in Chinese. The result analysis of the corpus proves its quality for the combination therapy related knowledge discovery task. Our annotated corpus would be a useful resource for the modelling of entity recognition and relation extraction tools. In the future, an evaluation based on the corpus will be held.
Collapse
Affiliation(s)
- Yueping Sun
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| | - Li Hou
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| | - Lu Qin
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| | - Yan Liu
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| | - Jiao Li
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| | - Qing Qian
- Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, 100020 China
| |
Collapse
|
24
|
Couch D, Yu Z, Nam JH, Allen C, Ramos PS, da Silveira WA, Hunt KJ, Hazard ES, Hardiman G, Lawson A, Chung D. GAIL: An interactive webserver for inference and dynamic visualization of gene-gene associations based on gene ontology guided mining of biomedical literature. PLoS One 2019; 14:e0219195. [PMID: 31260503 PMCID: PMC6602258 DOI: 10.1371/journal.pone.0219195] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 06/18/2019] [Indexed: 01/08/2023] Open
Abstract
In systems biology, inference of functional associations among genes is compelling because the construction of functional association networks facilitates biomarker discovery. Specifically, such gene associations in human can help identify putative biomarkers that can be used as diagnostic tools in treating patients. Although biomedical literature is considered a valuable data source for this task, currently only a limited number of webservers are available for mining gene-gene associations from the vast amount of biomedical literature using text mining techniques. Moreover, these webservers often have limited coverage of biomedical literature and also lack efficient and user-friendly tools to interpret and visualize mined relationships among genes. To address these limitations, we developed GAIL (Gene-gene Association Inference based on biomedical Literature), an interactive webserver that infers human gene-gene associations from Gene Ontology (GO) guided biomedical literature mining and provides dynamic visualization of the resulting association networks and various gene set enrichment analysis tools. We evaluate the utility and performance of GAIL with applications to gene signatures associated with systemic lupus erythematosus and breast cancer. Results show that GAIL allows effective interrogation and visualization of gene-gene networks and their subnetworks, which facilitates biological understanding of gene-gene associations. GAIL is available at http://chunglab.io/GAIL/.
Collapse
Affiliation(s)
- Daniel Couch
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Zhenning Yu
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Jin Hyun Nam
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Carter Allen
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Paula S. Ramos
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
- Department of Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Willian A. da Silveira
- Department of Pathology and Laboratory Medicine, Medical University of South Carolina, Charleston, SC, United States of America
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Kelly J. Hunt
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Edward S. Hazard
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Gary Hardiman
- Department of Medicine, Medical University of South Carolina, Charleston, SC, United States of America
- Center for Genomic Medicine, Medical University of South Carolina, Charleston, SC, United States of America
| | - Andrew Lawson
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| | - Dongjun Chung
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States of America
| |
Collapse
|
25
|
Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database (Oxford) 2019; 2019:baz045. [PMID: 31032839 PMCID: PMC6482935 DOI: 10.1093/database/baz045] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 02/26/2019] [Accepted: 03/18/2019] [Indexed: 01/01/2023]
Abstract
Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
Collapse
Affiliation(s)
- Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | | | - Judith A Blake
- The Jackson Laboratory, 600 Main St., Bar Harbor, ME, USA
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| | - Gongbo Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| |
Collapse
|
26
|
Urda D, Aragón F, Bautista R, Franco L, Veredas FJ, Claros MG, Jerez JM. BLASSO: integration of biological knowledge into a regularized linear model. BMC SYSTEMS BIOLOGY 2018; 12:94. [PMID: 30458775 PMCID: PMC6245593 DOI: 10.1186/s12918-018-0612-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Background In RNA-Seq gene expression analysis, a genetic signature or biomarker is defined as a subset of genes that is probably involved in a given complex human trait and usually provide predictive capabilities for that trait. The discovery of new genetic signatures is challenging, as it entails the analysis of complex-nature information encoded at gene level. Moreover, biomarkers selection becomes unstable, since high correlation among the thousands of genes included in each sample usually exists, thus obtaining very low overlapping rates between the genetic signatures proposed by different authors. In this sense, this paper proposes BLASSO, a simple and highly interpretable linear model with l1-regularization that incorporates prior biological knowledge to the prediction of breast cancer outcomes. Two different approaches to integrate biological knowledge in BLASSO, Gene-specific and Gene-disease, are proposed to test their predictive performance and biomarker stability on a public RNA-Seq gene expression dataset for breast cancer. The relevance of the genetic signature for the model is inspected by a functional analysis. Results BLASSO has been compared with a baseline LASSO model. Using 10-fold cross-validation with 100 repetitions for models’ assessment, average AUC values of 0.7 and 0.69 were obtained for the Gene-specific and the Gene-disease approaches, respectively. These efficacy rates outperform the average AUC of 0.65 obtained with the LASSO. With respect to the stability of the genetic signatures found, BLASSO outperformed the baseline model in terms of the robustness index (RI). The Gene-specific approach gave RI of 0.15±0.03, compared to RI of 0.09±0.03 given by LASSO, thus being 66% times more robust. The functional analysis performed to the genetic signature obtained with the Gene-disease approach showed a significant presence of genes related with cancer, as well as one gene (IFNK) and one pseudogene (PCNAP1) which a priori had not been described to be related with cancer. Conclusions BLASSO has been shown as a good choice both in terms of predictive efficacy and biomarker stability, when compared to other similar approaches. Further functional analyses of the genetic signatures obtained with BLASSO has not only revealed genes with important roles in cancer, but also genes that should play an unknown or collateral role in the studied disease.
Collapse
Affiliation(s)
- Daniel Urda
- Universidad de Cádiz, Departamento de Ingeniería Informática, Avda. de la Universidad de Cádiz n°10, Puerto Real, Cádiz, 11519, Spain.
| | - Francisco Aragón
- Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Rocío Bautista
- Universidad de Málaga, Plataforma Andaluza de Bioinformática, Parque Tecnológico de Andalucía, Calle Severo Ochoa 34, Málaga, 29590, Spain
| | - Leonardo Franco
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Francisco J Veredas
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Manuel Gonzalo Claros
- Universidad de Málaga, Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Campus Universitario de Teatinos, Málaga, 29071, Spain
| | - José Manuel Jerez
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| |
Collapse
|
27
|
Abstract
Pork accounts for more than one-third of meat produced worldwide and is an important component of global food security, agricultural economies, and trade. Infectious diseases are among the primary constraints to swine production, and the globalization of the swine industry has contributed to the emergence and spread of pathogens. Despite the importance of infectious diseases to animal health and the stability and productivity of the global swine industry, pathogens of swine have never been reviewed at a global scale. Here, we build a holistic global picture of research on swine pathogens to enhance preparedness and understand patterns of emergence and spread. By conducting a scoping review of more than 57,000 publications across 50 years, we identify priority pathogens globally and regionally, and characterize geographic and temporal trends in research priorities. Of the 40 identified pathogens, publication rates for eight pathogens increased faster than overall trends, suggesting that these pathogens may be emerging or constitute an increasing threat. We also compared regional patterns of pathogen prioritization in the context of policy differences, history of outbreaks, and differing swine health challenges faced in regions where swine production has become more industrialized. We documented a general increasing trend in importance of zoonotic pathogens and show that structural changes in the industry related to intensive swine production shift pathogen prioritization. Multinational collaboration networks were strongly shaped by region, colonial ties, and pig trade networks. This review represents the most comprehensive overview of research on swine infectious diseases to date.
Collapse
|
28
|
Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018; 34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open
Abstract
Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact zhiyong.lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Juliana Feltz
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Rama Maiti
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Tim Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
29
|
Fergadis A, Baziotis C, Pappas D, Papageorgiou H, Potamianos A. Hierarchical bi-directional attention-based RNNs for supporting document classification on protein-protein interactions affected by genetic mutations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5077305. [PMID: 30137284 PMCID: PMC6105093 DOI: 10.1093/database/bay076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 06/22/2018] [Indexed: 02/03/2023]
Abstract
In this paper, we describe a hierarchical bi-directional attention-based Re-current Neural Network (RNN) as a reusable sequence encoder architecture, which is used as sentence and document encoder for document classification. The sequence encoder is composed of two bi-directional RNN equipped with an attention mechanism that identifies and captures the most important elements, words or sentences, in a document followed by a dense layer for the classification task. Our approach utilizes the hierarchical nature of documents which are composed of sequences of sentences and sentences are composed of sequences of words. In our model, we use word embeddings to project the words to a low-dimensional vector space. We leverage word embeddings trained on PubMed for initializing the embedding layer of our network. We apply this model to biomedical literature specifically, on paper abstracts published in PubMed. We argue that the title of the paper itself usually contains important information more salient than a typical sentence in the abstract. For this reason, we propose a shortcut connection that integrates the title vector representation directly to the final feature representation of the document. We concatenate the sentence vector that represents the title and the vectors of the abstract to the document feature vector used as input to the task classifier. With this system we participated in the Document Triage Task of the BioCreative VI Precision Medicine Track and we achieved 0.6289 Precision, 0.7656 Recall and 0.6906 F1-score with the Precision and F1-score be the highest ranking first among the other systems. Database URL: https://github.com/afergadis/BC6PM-HRNN
Collapse
Affiliation(s)
- Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| | - Christos Baziotis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Department of Informatics, Athens University of Economics and Business, 76 Patission Str., Athens, Greece
| | - Dimitris Pappas
- Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece.,Department of Informatics, Athens University of Economics and Business, 76 Patission Str., Athens, Greece
| | - Haris Papageorgiou
- Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| | - Alexandros Potamianos
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus 9, Iroon Polytechniou str, Athens, Greece.,Institute for Language and Speech Processing, "Athena" Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi, Athens, Greece
| |
Collapse
|
30
|
Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018; 2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023]
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
Collapse
Affiliation(s)
- Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| |
Collapse
|
31
|
Differential gene expression in disease: a comparison between high-throughput studies and the literature. BMC Med Genomics 2017; 10:59. [PMID: 29020950 PMCID: PMC5637346 DOI: 10.1186/s12920-017-0293-y] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Accepted: 10/02/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Differential gene expression is important to understand the biological differences between healthy and diseased states. Two common sources of differential gene expression data are microarray studies and the biomedical literature. METHODS With the aid of text mining and gene expression analysis we have examined the comparative properties of these two sources of differential gene expression data. RESULTS The literature shows a preference for reporting genes associated to higher fold changes in microarray data, rather than genes that are simply significantly differentially expressed. Thus, the resemblance between the literature and microarray data increases when the fold-change threshold for microarray data is increased. Moreover, the literature has a reporting preference for differentially expressed genes that (1) are overexpressed rather than underexpressed; (2) are overexpressed in multiple diseases; and (3) are popular in the biomedical literature at large. Additionally, the degree to which diseases are similar depends on whether microarray data or the literature is used to compare them. Finally, vaguely-qualified reports of differential expression magnitudes in the literature have only small correlation with microarray fold-change data. CONCLUSIONS Reporting biases of differential gene expression in the literature can be affecting our appreciation of disease biology and of the degree of similarity that actually exists between different diseases.
Collapse
|
32
|
Camel V, Galeano E, Carrer H. RED DE COEXPRESIÓN DE 320 GENES DE Tectona grandis RELACIONADOS CON PROCESOS DE ESTRÉS ABIÓTICO Y XILOGÉNESIS. TIP REVISTA ESPECIALIZADA EN CIENCIAS QUÍMICO-BIOLÓGICAS 2017. [DOI: 10.1016/j.recqb.2017.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
|
33
|
Liu X, Yang Z, Lin H, Simmons M, Lu Z. DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC SYSTEMS BIOLOGY 2017; 11:23. [PMID: 28361678 PMCID: PMC5374555 DOI: 10.1186/s12918-017-0402-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
BACKGROUND An orphan disease is any disease that affects a small percentage of the population. Orphan diseases are a great burden to patients and society, and most of them are genetic in origin. Unfortunately, our current understanding of the genes responsible for inherited orphan diseases is still quite limited. Developing effective computational algorithms to discover disease-causing genes would help unveil disease mechanisms and may enable better diagnosis and treatment. RESULTS We have developed a novel method, named as DIGNiFI (Disease causIng GeNe FInder), which uses Protein-Protein Interaction (PPI) network-based features to discover and rank candidate disease-causing genes. Specifically, our approach computes topologically similar genes by taking into account both local and global connected paths in PPI networks via Direct Neighbors and Local Random Walks, respectively. Furthermore, since genes with similar phenotypes tend to be functionally related, we have integrated PPI data with gene ontology (GO) annotations and protein complex data to further improve the performance of this approach. Results of 128 orphan diseases with 1184 known disease genes collected from the Orphanet show that our proposed methods outperform existing state-of-the-art methods for discovering candidate disease-causing genes. We also show that further performance improvement can be achieved when enriching the human-curated PPI network data with text-mined interactions from the biomedical literature. Finally, we demonstrate the utility of our approach by applying our method to identifying novel candidate genes for a set of four inherited retinal dystrophies. In this study, we found the top predictions for these retinal dystrophies consistent with literature reports and online databases of other retinal dystrophies. CONCLUSIONS Our method successfully prioritizes orphan-disease-causative genes. This method has great potential to benefit the field of orphan disease research, where resources are scarce and greatly needed.
Collapse
Affiliation(s)
- Xiaoxia Liu
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.,National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA.
| |
Collapse
|
34
|
Singhal A, Leaman R, Catlett N, Lemberger T, McEntyre J, Polson S, Xenarios I, Arighi C, Lu Z. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges. Database (Oxford) 2016; 2016:baw161. [PMID: 28025348 PMCID: PMC5199160 DOI: 10.1093/database/baw161] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 11/10/2016] [Accepted: 11/11/2016] [Indexed: 12/24/2022]
Abstract
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Shawn Polson
- Center for Bioinformatics and Computational Biology and Department of Computer and Information Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711, USA
| | | | - Cecilia Arighi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
- Center for Bioinformatics and Computational Biology and Department of Computer and Information Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
35
|
Gore R, Diallo S, Padilla J. Classifying modeling and simulation as a scientific discipline. Scientometrics 2016. [DOI: 10.1007/s11192-016-2050-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
36
|
Liu JL, Zhao M. A PubMed-wide study of endometriosis. Genomics 2016; 108:151-157. [DOI: 10.1016/j.ygeno.2016.10.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Revised: 09/30/2016] [Accepted: 10/12/2016] [Indexed: 12/18/2022]
|
37
|
Evaluation and Verification of the Global Rapid Identification of Threats System for Infectious Diseases in Textual Data Sources. Interdiscip Perspect Infect Dis 2016; 2016:5080746. [PMID: 27698665 PMCID: PMC5028852 DOI: 10.1155/2016/5080746] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Revised: 08/06/2016] [Accepted: 08/15/2016] [Indexed: 11/17/2022] Open
Abstract
The Global Rapid Identification of Threats System (GRITS) is a biosurveillance application that enables infectious disease analysts to monitor nontraditional information sources (e.g., social media, online news outlets, ProMED-mail reports, and blogs) for infectious disease threats. GRITS analyzes these textual data sources by identifying, extracting, and succinctly visualizing epidemiologic information and suggests potentially associated infectious diseases. This manuscript evaluates and verifies the diagnoses that GRITS performs and discusses novel aspects of the software package. Via GRITS' web interface, infectious disease analysts can examine dynamic visualizations of GRITS' analyses and explore historical infectious disease emergence events. The GRITS API can be used to continuously analyze information feeds, and the API enables GRITS technology to be easily incorporated into other biosurveillance systems. GRITS is a flexible tool that can be modified to conduct sophisticated medical report triaging, expanded to include customized alert systems, and tailored to address other biosurveillance needs.
Collapse
|
38
|
Fluck J, Madan S, Ansari S, Kodamullil AT, Karki R, Rastegar-Mojarad M, Catlett NL, Hayes W, Szostak J, Hoeng J, Peitsch M. Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw113. [PMID: 27554092 PMCID: PMC4995071 DOI: 10.1093/database/baw113] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 07/07/2016] [Indexed: 01/21/2023]
Abstract
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL:http://wiki.openbel.org/display/BIOC/Datasets
Collapse
Affiliation(s)
- Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sam Ansari
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Alpha T Kodamullil
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Reagon Karki
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | | | | | - William Hayes
- Selventa, One Alewife Center, Cambridge, MA 02140, USA
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Manuel Peitsch
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| |
Collapse
|
39
|
Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw068. [PMID: 27161011 PMCID: PMC4860626 DOI: 10.1093/database/baw068] [Citation(s) in RCA: 138] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 04/11/2016] [Indexed: 11/14/2022]
Abstract
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/.
Collapse
Affiliation(s)
- Jiao Li
- 1Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
| | - Yueping Sun
- 1Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
| | - Robin J Johnson
- 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| | - Daniela Sciaky
- 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| | - Chih-Hsuan Wei
- 3National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Robert Leaman
- 3National Center for Biotechnology Information, Bethesda, MD 20894, USA
| | - Allan Peter Davis
- 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| | - Carolyn J Mattingly
- 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| | - Thomas C Wiegers
- 2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA
| | - Zhiyong Lu
- 3National Center for Biotechnology Information, Bethesda, MD 20894, USA
| |
Collapse
|
40
|
Gu J, Qian L, Zhou G. Chemical-induced disease relation extraction with various linguistic features. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw042. [PMID: 27052618 PMCID: PMC4822558 DOI: 10.1093/database/baw042] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 03/04/2016] [Indexed: 01/06/2023]
Abstract
Understanding the relations between chemicals and diseases is crucial in various biomedical tasks such as new drug discoveries and new therapy developments. While manually mining these relations from the biomedical literature is costly and time-consuming, such a procedure is often difficult to keep up-to-date. To address these issues, the BioCreative-V community proposed a challenging task of automatic extraction of chemical-induced disease (CID) relations in order to benefit biocuration. This article describes our work on the CID relation extraction task on the BioCreative-V tasks. We built a machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models. In addition to leveraging various features, the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary were also employed during both training and testing stages to obtain more accurate classification models and better extraction performance, respectively. We demoted relation extraction between entities in documents to relation extraction between entity mentions. In our system, pairs of chemical and disease mentions at both intra- and inter-sentence levels were first constructed as relation instances for training and testing, then two classification models at both levels were trained from the training examples and applied to the testing examples. Finally, we merged the classification results from mention level to document level to acquire final relations between chemicals and diseases. Our system achieved promising F-scores of 60.4% on the development dataset and 58.3% on the test dataset using gold-standard entity annotations, respectively. Database URL: https://github.com/JHnlp/BC5CIDTask
Collapse
Affiliation(s)
- Jinghang Gu
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Longhua Qian
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| | - Guodong Zhou
- Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, 1 Shizi Street, Suzhou, China, 215006
| |
Collapse
|
41
|
Pafilis E, Buttigieg PL, Ferrell B, Pereira E, Schnetzer J, Arvanitidis C, Jensen LJ. EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw005. [PMID: 26896844 PMCID: PMC4761108 DOI: 10.1093/database/baw005] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/11/2016] [Indexed: 12/11/2022]
Abstract
The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15-25% and helps curators to detect terms that would otherwise have been missed. Database URL: https://extract.hcmr.gr/.
Collapse
Affiliation(s)
- Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, P.O. Box 2214, 71003 Heraklion, Crete, Greece,
| | - Pier Luigi Buttigieg
- Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Am Handelshafen 12, D-27570 Bremerhaven, Germany
| | - Barbra Ferrell
- Delaware Biotechnology Institute, Newark, DE 19711, Delaware, USA
| | - Emiliano Pereira
- Max Planck Institute for Marine Microbiology, Celsiusstr. 1, 28359, Bremen, Germany
| | - Julia Schnetzer
- Max Planck Institute for Marine Microbiology, Celsiusstr. 1, 28359, Bremen, Germany, Jacobs University gGmbH, School of Engineering and Sciences, Campus Ring 1, 28759, Bremen, Germany, and
| | - Christos Arvanitidis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, P.O. Box 2214, 71003 Heraklion, Crete, Greece
| | - Lars Juhl Jensen
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3B, DK-2200, Copenhagen, Denmark
| |
Collapse
|
42
|
Serin EAR, Nijveen H, Hilhorst HWM, Ligterink W. Learning from Co-expression Networks: Possibilities and Challenges. FRONTIERS IN PLANT SCIENCE 2016; 7:444. [PMID: 27092161 PMCID: PMC4825623 DOI: 10.3389/fpls.2016.00444] [Citation(s) in RCA: 186] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 03/21/2016] [Indexed: 05/18/2023]
Abstract
Plants are fascinating and complex organisms. A comprehensive understanding of the organization, function and evolution of plant genes is essential to disentangle important biological processes and to advance crop engineering and breeding strategies. The ultimate aim in deciphering complex biological processes is the discovery of causal genes and regulatory mechanisms controlling these processes. The recent surge of omics data has opened the door to a system-wide understanding of the flow of biological information underlying complex traits. However, dealing with the corresponding large data sets represents a challenging endeavor that calls for the development of powerful bioinformatics methods. A popular approach is the construction and analysis of gene networks. Such networks are often used for genome-wide representation of the complex functional organization of biological systems. Network based on similarity in gene expression are called (gene) co-expression networks. One of the major application of gene co-expression networks is the functional annotation of unknown genes. Constructing co-expression networks is generally straightforward. In contrast, the resulting network of connected genes can become very complex, which limits its biological interpretation. Several strategies can be employed to enhance the interpretation of the networks. A strategy in coherence with the biological question addressed needs to be established to infer reliable networks. Additional benefits can be gained from network-based strategies using prior knowledge and data integration to further enhance the elucidation of gene regulatory relationships. As a result, biological networks provide many more applications beyond the simple visualization of co-expressed genes. In this study we review the different approaches for co-expression network inference in plants. We analyse integrative genomics strategies used in recent studies that successfully identified candidate genes taking advantage of gene co-expression networks. Additionally, we discuss promising bioinformatics approaches that predict networks for specific purposes.
Collapse
Affiliation(s)
- Elise A. R. Serin
- Wageningen Seed Lab, Laboratory of Plant Physiology, Wageningen UniversityWageningen, Netherlands
| | - Harm Nijveen
- Wageningen Seed Lab, Laboratory of Plant Physiology, Wageningen UniversityWageningen, Netherlands
- Laboratory of Bioinformatics, Wageningen UniversityWageningen, Netherlands
| | - Henk W. M. Hilhorst
- Wageningen Seed Lab, Laboratory of Plant Physiology, Wageningen UniversityWageningen, Netherlands
| | - Wilco Ligterink
- Wageningen Seed Lab, Laboratory of Plant Physiology, Wageningen UniversityWageningen, Netherlands
- *Correspondence: Wilco Ligterink
| |
Collapse
|
43
|
Rodriguez-Esteban R. Biocuration with insufficient resources and fixed timelines. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav116. [PMID: 26708987 PMCID: PMC4691339 DOI: 10.1093/database/bav116] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 11/17/2015] [Indexed: 11/14/2022]
Abstract
Biological curation, or biocuration, is often studied from the perspective of creating and maintaining databases that have the goal of mapping and tracking certain areas of biology. However, much biocuration is, in fact, dedicated to finite and time-limited projects in which insufficient resources demand trade-offs. This typically more ephemeral type of curation is nonetheless of importance in biomedical research. Here, I propose a framework to understand such restricted curation projects from the point of view of return on curation (ROC), value, efficiency and productivity. Moreover, I suggest general strategies to optimize these curation efforts, such as the ‘multiple strategies’ approach, as well as a metric called overhead that can be used in the context of managing curation resources.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center Basel, Basel 4070, Switzerland
| |
Collapse
|
44
|
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BIOMED RESEARCH INTERNATIONAL 2015; 2015:918710. [PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710] [Citation(s) in RCA: 111] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 02/01/2023]
Abstract
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.
Collapse
|
45
|
Wei CH, Leaman R, Lu Z. SimConcept: a hybrid approach for simplifying composite named entities in biomedical text. IEEE J Biomed Health Inform 2015; 19:1385-91. [PMID: 25879978 PMCID: PMC4543296 DOI: 10.1109/jbhi.2015.2422651] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
One particular challenge in biomedical named entity recognition (NER) and normalization is the identification and resolution of composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Previous NER and normalization studies have either ignored composite mentions, used simple ad hoc rules, or only handled coordination ellipsis, making a robust approach for handling multitype composite mentions greatly needed. To this end, we propose a hybrid method integrating a machine-learning model with a pattern identification strategy to identify the individual components of each composite mention. Our method, which we have named SimConcept, is the first to systematically handle many types of composite mentions. The technique achieves high performance in identifying and resolving composite mentions for three key biological entities: genes (90.42% in F-measure), diseases (86.47% in F-measure), and chemicals (86.05% in F-measure). Furthermore, our results show that using our SimConcept method can subsequently improve the performance of gene and disease concept recognition and normalization. SimConcept is available for download at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/SimConcept/.
Collapse
|
46
|
Hsu YY, Kao HY. Curatable Named-Entity Recognition Using Semantic Relations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:785-792. [PMID: 26357317 DOI: 10.1109/tcbb.2014.2366770] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Named-entity recognition (NER) plays an important role in the development of biomedical databases. However, the existing NER tools produce multifarious named-entities which may result in both curatable and non-curatable markers. To facilitate biocuration with a straightforward approach, classifying curatable named-entities is helpful with regard to accelerating the biocuration workflow. Co-occurrence Interaction Nexus with Named-entity Recognition (CoINNER) is a web-based tool that allows users to identify genes, chemicals, diseases, and action term mentions in the Comparative Toxicogenomic Database (CTD). To further discover interactions, CoINNER uses multiple advanced algorithms to recognize the mentions in the BioCreative IV CTD Track. CoINNER is developed based on a prototype system that annotated gene, chemical, and disease mentions in PubMed abstracts at BioCreative 2012 Track I (literature triage). We extended our previous system in developing CoINNER. The pre-tagging results of CoINNER were developed based on the state-of-the-art named entity recognition tools in BioCreative III. Next, a method based on conditional random fields (CRFs) is proposed to predict chemical and disease mentions in the articles. Finally, action term mentions were collected by latent Dirichlet allocation (LDA). At the BioCreative IV CTD Track, the best F-measures reached for gene/protein, chemical/drug and disease NER were 54 percent while CoINNER achieved a 61.5 percent F-measure. System URL: http://ikmbio.csie.ncku.edu.tw/coinner/ introduction.htm.
Collapse
|
47
|
Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 2015; 17:132-44. [PMID: 25935162 DOI: 10.1093/bib/bbv024] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Indexed: 11/13/2022] Open
Abstract
One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations.
Collapse
|
48
|
Khare R, Good BM, Leaman R, Su AI, Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2015; 17:23-32. [PMID: 25888696 DOI: 10.1093/bib/bbv021] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The use of crowdsourcing to solve important but complex problems in biomedical and clinical sciences is growing and encompasses a wide variety of approaches. The crowd is diverse and includes online marketplace workers, health information seekers, science enthusiasts and domain experts. In this article, we review and highlight recent studies that use crowdsourcing to advance biomedicine. We classify these studies into two broad categories: (i) mining big data generated from a crowd (e.g. search logs) and (ii) active crowdsourcing via specific technical platforms, e.g. labor markets, wikis, scientific games and community challenges. Through describing each study in detail, we demonstrate the applicability of different methods in a variety of domains in biomedical research, including genomics, biocuration and clinical research. Furthermore, we discuss and highlight the strengths and limitations of different crowdsourcing platforms. Finally, we identify important emerging trends, opportunities and remaining challenges for future crowdsourcing research in biomedicine.
Collapse
|
49
|
Khare R, Burger JD, Aberdeen JS, Tresner-Kirsch DW, Corrales TJ, Hirchman L, Lu Z. Scaling drug indication curation through crowdsourcing. Database (Oxford) 2015; 2015:bav016. [PMID: 25797061 PMCID: PMC4369375 DOI: 10.1093/database/bav016] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2014] [Revised: 02/04/2015] [Accepted: 02/09/2015] [Indexed: 01/24/2023]
Abstract
Motivated by the high cost of human curation of biological databases, there is an increasing interest in using computational approaches to assist human curators and accelerate the manual curation process. Towards the goal of cataloging drug indications from FDA drug labels, we recently developed LabeledIn, a human-curated drug indication resource for 250 clinical drugs. Its development required over 40 h of human effort across 20 weeks, despite using well-defined annotation guidelines. In this study, we aim to investigate the feasibility of scaling drug indication annotation through a crowdsourcing technique where an unknown network of workers can be recruited through the technical environment of Amazon Mechanical Turk (MTurk). To translate the expert-curation task of cataloging indications into human intelligence tasks (HITs) suitable for the average workers on MTurk, we first simplify the complex task such that each HIT only involves a worker making a binary judgment of whether a highlighted disease, in context of a given drug label, is an indication. In addition, this study is novel in the crowdsourcing interface design where the annotation guidelines are encoded into user options. For evaluation, we assess the ability of our proposed method to achieve high-quality annotations in a time-efficient and cost-effective manner. We posted over 3000 HITs drawn from 706 drug labels on MTurk. Within 8 h of posting, we collected 18 775 judgments from 74 workers, and achieved an aggregated accuracy of 96% on 450 control HITs (where gold-standard answers are known), at a cost of $1.75 per drug label. On the basis of these results, we conclude that our crowdsourcing approach not only results in significant cost and time saving, but also leads to accuracy comparable to that of domain experts.
Collapse
Affiliation(s)
- Ritu Khare
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - John D Burger
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | - John S Aberdeen
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | | | - Theodore J Corrales
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA, Montgomery Blair High School, 57 University Blvd E., Silver Spring, MD 20901, USA
| | - Lynette Hirchman
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA.
| |
Collapse
|
50
|
Leaman R, Wei CH, Lu Z. tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform 2015; 7:S3. [PMID: 25810774 PMCID: PMC4331693 DOI: 10.1186/1758-2946-7-s1-s3] [Citation(s) in RCA: 126] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature. We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks. We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance. The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator
Collapse
Affiliation(s)
- Robert Leaman
- National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| |
Collapse
|