1
|
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. Improving dictionary-based named entity recognition with deep learning. Bioinformatics 2024; 40:ii45-ii52. [PMID: 39230709 PMCID: PMC11373323 DOI: 10.1093/bioinformatics/btae402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku, 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| |
Collapse
|
2
|
Yang TH, Yu YH, Wu SH, Chang FY, Tsai HC, Yang YC. DMLS: an automated pipeline to extract the Drosophila modular transcription regulators and targets from massive literature articles. Database (Oxford) 2024; 2024:0. [PMID: 38900628 PMCID: PMC11188685 DOI: 10.1093/database/baae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 05/10/2024] [Accepted: 05/31/2024] [Indexed: 06/22/2024]
Abstract
Transcription regulation in multicellular species is mediated by modular transcription factor (TF) binding site combinations termed cis-regulatory modules (CRMs). Such CRM-mediated transcription regulation determines the gene expression patterns during development. Biologists frequently investigate CRM transcription regulation on gene expressions. However, the knowledge of the target genes and regulatory TFs participating in the CRMs under study is mostly fragmentary throughout the literature. Researchers need to afford tremendous human resources to fully surf through the articles deposited in biomedical literature databases in order to obtain the information. Although several novel text-mining systems are now available for literature triaging, these tools do not specifically focus on CRM-related literature prescreening, failing to correctly extract the information of the CRM target genes and regulatory TFs from the literature. For this reason, we constructed a supportive auto-literature prescreener called Drosophila Modular transcription-regulation Literature Screener (DMLS) that achieves the following: (i) prescreens articles describing experiments on modular transcription regulation, (ii) identifies the described target genes and TFs of the CRMs under study for each modular transcription-regulation-describing article and (iii) features an automated and extendable pipeline to perform the task. We demonstrated that the final performance of DMLS in extracting the described target gene and regulatory TF lists of CRMs under study for given articles achieved test macro area under the ROC curve (auROC) = 89.7% and area under the precision-recall curve (auPRC) = 77.6%, outperforming the intuitive gene name-occurrence-counting method by at least 19.9% in auROC and 30.5% in auPRC. The web service and the command line versions of DMLS are available at https://cobis.bme.ncku.edu.tw/DMLS/ and https://github.com/cobisLab/DMLS/, respectively. Database Tool URL: https://cobis.bme.ncku.edu.tw/DMLS/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
- Medical Device Innovation Center, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Yu-Huai Yu
- Department of Biomedical Engineering, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Sheng-Hang Wu
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Fang-Yuan Chang
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Hsiu-Chun Tsai
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Ya-Chiao Yang
- Institute of Information Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd., East Dist., Hsinchu 300093,Taiwan
| |
Collapse
|
3
|
Zhang Q, Weng L, Li J. The evolution of intracranial aneurysm research from 2012 to 2021: Global productivity and publication trends. Front Neurol 2022; 13:953285. [PMID: 36247771 PMCID: PMC9554263 DOI: 10.3389/fneur.2022.953285] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 07/01/2022] [Indexed: 11/13/2022] Open
Abstract
Background This study aimed to analyze the global research trends and map the knowledge network of intracranial aneurysm (IA) research in the last 10 years. Methods Publications related to IA from 2012 to 2021 were retrieved from the Web of Science core collection. Microsoft Excel 2010 and VOSviewer were used to characterize the largest contributors, including authors, journals, institutions, and countries. CiteSpace and VOSviewer were adopted to analyze the trends and knowledge network of IA. Results A total of 5,406 publications related to IA from 2012 to 2021 were identified, increasing from 344 in 2012 to 762 in 2021. Siddiqui, AH from the USA contributed the most publications. Papers published in the journal World Neurosurgery ranked first in quantity, while Stroke ranked first for total citations and citations per publication. The top three prolific institutions were Capital Medical University, Mayo Clinic, and the University Department of Neurology Utrecht from 2012 to 2021. Moreover, the USA held the greatest share in the field, and China was almost on par with the USA due to its rapid growth. Specifically, the most frequently covered topics over the recent decade were subarachnoid hemorrhage, endovascular treatment (EVT), clipping, vascular disorders, flow diverter, stent, delayed cerebral ischemia, inflammation, and hemodynamics. Conclusion The contribution made by different countries, institutions, journals, and authors for IA research over the past decade was demonstrated in the paper. The main topics include the choice of EVT or surgical clipping, particularly the application of flow diverter and associated complications, while themes such as the etiopathogenetic features of IA (e.g., inflammation and hemodynamics) deserve more attention.
Collapse
Affiliation(s)
- Qian Zhang
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China
| | - Ling Weng
- Department of Neurology, Xiangya Hospital, Central South University, Changsha, China
- National Clinical Research Center for Geriatric Disorders, Central South University, Changsha, China
| | - Jian Li
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China
- Hydrocephalus Center, Xiangya Hospital, Central South University, Changsha, China
- *Correspondence: Jian Li
| |
Collapse
|
4
|
Yang TH, Wang CY, Tsai HC, Yang YC, Liu CT. YTLR: Extracting yeast transcription factor-gene associations from the literature using automated literature readers. Comput Struct Biotechnol J 2022; 20:4636-4644. [PMID: 36090812 PMCID: PMC9449546 DOI: 10.1016/j.csbj.2022.08.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/16/2022] [Accepted: 08/18/2022] [Indexed: 11/19/2022] Open
Abstract
Cells adapt to environmental stresses mainly via transcription reprogramming. Correct transcription control is mediated by the interactions between transcription factors (TF) and their target genes. These TF-gene associations can be probed by chromatin immunoprecipitation techniques and knockout experiments, revealing TF binding (TFB) and regulatory (TFR) evidence, respectively. Nevertheless, most evidence is still fragmentary in the literature and requires tremendous human resources to curate. We developed the first pipeline called YTLR (Yeast Transcription-regulation Literature Reader) to automate TF-gene relation extraction from the literature. YTLR first identifies articles with TFB and TFR information. Then TF-gene binding pairs are extracted from the TFB articles, and TF-gene regulatory associations are recognized from the TFR papers. On gathered test sets, YTLR achieves an AUC value of 98.8% in identifying articles with TFB evidence and AUC = 83.4% in extracting the detailed TF-gene binding pairs. And similarly, YTLR also obtains an AUC value of 98.2% in identifying TFR articles and AUC = 80.4% in extracting the detailed TF-gene regulatory associations. Furthermore, YTLR outperforms previous methods in both tasks. To facilitate researchers in extracting TF-gene transcriptional relations from large-scale queried articles, an automated and easy-to-use software tool based on the YTLR pipeline is constructed. In summary, YTLR aims to provide easier literature pre-screening for curators and help researchers gather yeast TF-gene transcriptional relation conclusions from articles in a high-throughput fashion. The YTLR pipeline software tool can be downloaded at https://github.com/cobisLab/YTLR/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
- Corresponding author.
| | - Chung-Yu Wang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Hsiu-Chun Tsai
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Ya-Chiao Yang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| | - Cheng-Tse Liu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan
| |
Collapse
|
5
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
6
|
Frisoni G, Moro G, Carlassare G, Carbonaro A. Unsupervised Event Graph Representation and Similarity Learning on Biomedical Literature. SENSORS (BASEL, SWITZERLAND) 2021; 22:3. [PMID: 35009544 PMCID: PMC8747118 DOI: 10.3390/s22010003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 12/08/2021] [Accepted: 12/09/2021] [Indexed: 06/14/2023]
Abstract
The automatic extraction of biomedical events from the scientific literature has drawn keen interest in the last several years, recognizing complex and semantically rich graphical interactions otherwise buried in texts. However, very few works revolve around learning embeddings or similarity metrics for event graphs. This gap leaves biological relations unlinked and prevents the application of machine learning techniques to promote discoveries. Taking advantage of recent deep graph kernel solutions and pre-trained language models, we propose Deep Divergence Event Graph Kernels (DDEGK), an unsupervised inductive method to map events into low-dimensional vectors, preserving their structural and semantic similarities. Unlike most other systems, DDEGK operates at a graph level and does not require task-specific labels, feature engineering, or known correspondences between nodes. To this end, our solution compares events against a small set of anchor ones, trains cross-graph attention networks for drawing pairwise alignments (bolstering interpretability), and employs transformer-based models to encode continuous attributes. Extensive experiments have been done on nine biomedical datasets. We show that our learned event representations can be effectively employed in tasks such as graph classification, clustering, and visualization, also facilitating downstream semantic textual similarity. Empirical results demonstrate that DDEGK significantly outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Giacomo Frisoni
- Department of Computer Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy; (G.F.); (G.M.)
| | - Gianluca Moro
- Department of Computer Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy; (G.F.); (G.M.)
| | | | - Antonella Carbonaro
- Department of Computer Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy; (G.F.); (G.M.)
| |
Collapse
|
7
|
Holtzapple E, Telmer CA, Miskov-Zivanov N. FLUTE: Fast and reliable knowledge retrieval from biomedical literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5880277. [PMID: 32761077 PMCID: PMC7408180 DOI: 10.1093/database/baaa056] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 05/21/2020] [Accepted: 07/02/2020] [Indexed: 12/14/2022]
Abstract
State-of-the-art machine reading methods extract, in hours, hundreds of thousands of events from the biomedical literature. However, many of the extracted biomolecular interactions are incorrect or not relevant for computational modeling of a system of interest. Therefore, rapid, automated methods are required to filter and select accurate and useful information. The FiLter for Understanding True Events (FLUTE) tool uses public protein interaction databases to filter interactions that have been extracted by machines from databases such as PubMed and score them for accuracy. Confidence in the interactions allows for rapid and accurate model assembly. As our results show, FLUTE can reliably determine the confidence in the biomolecular interactions extracted by fast machine readers and at the same time provide a speedup in interaction filtering by three orders of magnitude. Database URL: https://bitbucket.org/biodesignlab/flute.
Collapse
Affiliation(s)
- Emilee Holtzapple
- Department of Computational and Systems Biology, University of Pittsburgh, 3501 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA
| | - Cheryl A Telmer
- Molecular Biosensor and Imagining Center, Carnegie Mellon University, 4400 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA
| | - Natasa Miskov-Zivanov
- Department of Computational and Systems Biology, University of Pittsburgh, 3501 Fifth Ave, Pittsburgh, Pennsylvania 15213, USA.,Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O'Hara St, Pittsburgh, Pennsylvania 15261, USA.,Department of Bioengineering, University of Pittsburgh, 300 Technology Dr, Pittsburgh 15213, USA
| |
Collapse
|
8
|
Usmani S, Shamsi JA. News sensitive stock market prediction: literature review and suggestions. PeerJ Comput Sci 2021; 7:e490. [PMID: 34013029 PMCID: PMC8114814 DOI: 10.7717/peerj-cs.490] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 03/23/2021] [Indexed: 06/12/2023]
Abstract
Stock market prediction is a challenging task as it requires deep insights for extraction of news events, analysis of historic data, and impact of news events on stock price trends. The challenge is further exacerbated due to the high volatility of stock price trends. However, a detailed overview that discusses the overall context of stock prediction is elusive in literature. To address this research gap, this paper presents a detailed survey. All key terms and phases of generic stock prediction methodology along with challenges, are described. A detailed literature review that covers data preprocessing techniques, feature extraction techniques, prediction techniques, and future directions is presented for news sensitive stock prediction. This work investigates the significance of using structured text features rather than unstructured and shallow text features. It also discusses the use of opinion extraction techniques. In addition, it emphasizes the use of domain knowledge with both approaches of textual feature extraction. Furthermore, it highlights the significance of deep neural network based prediction techniques to capture the hidden relationship between textual and numerical data. This survey is significant and novel as it elaborates a comprehensive framework for stock market prediction and highlights the strengths and weaknesses of existing approaches. It presents a wide range of open issues and research directions that are beneficial for the research community.
Collapse
Affiliation(s)
- Shazia Usmani
- Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan
| | - Jawwad A. Shamsi
- Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan
| |
Collapse
|
9
|
Exploiting dependency information to improve biomedical event detection via gated polar attention mechanism. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.09.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
10
|
Kanerva J, Ginter F, Pyysalo S. Dependency parsing of biomedical text with BERT. BMC Bioinformatics 2020; 21:580. [PMID: 33372589 PMCID: PMC7771067 DOI: 10.1186/s12859-020-03905-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 11/24/2020] [Indexed: 11/23/2022] Open
Abstract
Background: Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine. Methods: We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing. Results: We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.
Collapse
Affiliation(s)
- Jenna Kanerva
- TurkuNLP Group, University of Turku, Turku, Finland.
| | - Filip Ginter
- TurkuNLP Group, University of Turku, Turku, Finland
| | | |
Collapse
|
11
|
Joint event extraction along shortest dependency paths using graph convolutional networks. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106492] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
12
|
Yan S, Wong KC. Context awareness and embedding for biomedical event extraction. Bioinformatics 2020; 36:637-643. [PMID: 31392318 DOI: 10.1093/bioinformatics/btz607] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 07/26/2019] [Accepted: 08/06/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from the literature. Limited by the event context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management. RESULTS We consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bidirectional long short-term memory network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets, respectively. Comparing with seven state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 versus 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons. AVAILABILITY AND IMPLEMENTATION https://github.com/cskyan/evntextrc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shankai Yan
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| |
Collapse
|
13
|
Kongburan W, Padungweang P, Krathu W, Chan JH. Enhancing metabolic event extraction performance with multitask learning concept. J Biomed Inform 2019; 93:103156. [PMID: 30902595 DOI: 10.1016/j.jbi.2019.103156] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 03/12/2019] [Accepted: 03/19/2019] [Indexed: 11/29/2022]
Abstract
To extract and generate a valid metabolic pathway from research articles, biologists need substantial amounts of time to digest unstructured text. Text mining currently plays a central role in this research area, because it provides the ability to automatically discover useful information in a reasonable time. A text mining model can be built using a training data or a corpus in supervised manner. Unfortunately, a corpus of the domain of interest may not be always available or insufficient in practice, because a corpus construction is a labor-intensive task and needs specialist annotation. In this paper, we developed an event extraction system, a text-mining task, to extract metabolic interactions from research literature and then reconstruct metabolic pathways. The proposed system consists of the pipeline of four supervised-learning steps: named entity recognition, trigger detection, edge detection, and event reconstruction. We also introduced a multitask-learning algorithm, a transfer-learning paradigm, that can leverage additional resources of an existing source domain to facilitate a classification of the metabolic event extraction in the target domain. To demonstrate a proof of concept, edge detection, a core step in our event extraction system, was used as a case study in multitask-learning classification. The experimental results showed that the proposed event extraction system provided competitive performance against those of state-of-the-art related system. In particular, the proposed multitask-learning can improve the performance of edge detection, therefore the overall performance of the event extraction system was also improved accordingly.
Collapse
Affiliation(s)
- Wutthipong Kongburan
- Data Science and Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
| | - Praisan Padungweang
- Data Science and Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
| | - Worarat Krathu
- Data Science and Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
| | - Jonathan H Chan
- Data Science and Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
| |
Collapse
|
14
|
Jiang N, Rong W, Nie Y, Shen Y, Xiong Z. Biological Event Trigger Identification with Noise Contrastive Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1549-1559. [PMID: 30296207 DOI: 10.1109/tcbb.2017.2710048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Biological Event Extraction is an important task towards the goal of extracting biomedical knowledge from the scientific publications by capturing biomedical entities and their complex relations from the texts. As a crucial step in event extraction, event trigger identification, assigning words with suitable trigger category, has recently attracted substantial attention. As triggers are scattered in large corpus, traditional linguistic parsers are hard to generate syntactic features from them. Thereby, trigger sparsity problem restricts the model's learning process and becomes one of the main hinder in trigger identification. In this paper, we employ Noise Contrastive Estimation with Multi-Layer Perceptron model for solving triggers' sparsity problem. Meanwhile, in the light of recent advance in word distributed representation, word-embedding feature generated by language model is utilized for semantic and syntactic information extraction. Finally, experimental study on commonly used MLEE dataset against baseline methods has demonstrated its promising result.
Collapse
|
15
|
Raja K, Natarajan J. Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 160:57-64. [PMID: 29728247 DOI: 10.1016/j.cmpb.2018.03.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2016] [Revised: 02/23/2018] [Accepted: 03/22/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. OBJECTIVE In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. METHODS First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. RESULTS The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. CONCLUSIONS The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus.
Collapse
Affiliation(s)
- Kalpana Raja
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, School of Life Sciences, Bharathiar University, Coimbatore 641046, India.
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, School of Life Sciences, Bharathiar University, Coimbatore 641046, India.
| |
Collapse
|
16
|
Song M, Kim M, Kang K, Kim YH, Jeon S. Application of Public Knowledge Discovery Tool (PKDE4J) to Represent Biomedical Scientific Knowledge. Front Res Metr Anal 2018. [DOI: 10.3389/frma.2018.00007] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
|
17
|
Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 2017; 33:3784-3792. [PMID: 29036627 PMCID: PMC5860317 DOI: 10.1093/bioinformatics/btx466] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 06/27/2017] [Accepted: 07/21/2017] [Indexed: 11/20/2022] Open
Abstract
MOTIVATION In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. RESULTS We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. AVAILABILITY AND IMPLEMENTATION The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. CONTACT sophia.ananiadou@manchester.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chrysoula Zerva
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Philip Day
- Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
18
|
Metabolic Pathway Mining. Methods Mol Biol 2016. [PMID: 27896740 DOI: 10.1007/978-1-4939-6613-4_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Understanding metabolic pathways is one of the most important fields in bioscience in the post-genomic era, but curating metabolic pathways requires considerable man-power. As such there is a lack of reliable, experimentally verified metabolic pathways in databases and databases are forced to predict all but the most immediately useful pathways.Text-mining has the potential to solve this problem, but while sophisticated text-mining methods have been developed to assist the curation of many types of biomedical networks, such as protein-protein interaction networks, the mining of metabolic pathways from the literature has been largely neglected by the text-mining community. In this chapter we describe a pipeline for the extraction of metabolic pathways built on freely available open-source components and a heuristic metabolic reaction extraction algorithm.
Collapse
|
19
|
Wu C, Schwartz JM, Brabant G, Peng SL, Nenadic G. Constructing a molecular interaction network for thyroid cancer via large-scale text mining of gene and pathway events. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 6:S5. [PMID: 26679379 PMCID: PMC4674859 DOI: 10.1186/1752-0509-9-s6-s5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Background Biomedical studies need assistance from automated tools and easily accessible data to address the problem of the rapidly accumulating literature. Text-mining tools and curated databases have been developed to address such needs and they can be applied to improve the understanding of molecular pathogenesis of complex diseases like thyroid cancer. Results We have developed a system, PWTEES, which extracts pathway interactions from the literature utilizing an existing event extraction tool (TEES) and pathway named entity recognition (PathNER). We then applied the system on a thyroid cancer corpus and systematically extracted molecular interactions involving either genes or pathways. With the extracted information, we constructed a molecular interaction network taking genes and pathways as nodes. Using curated pathway information and network topological analyses, we highlight key genes and pathways involved in thyroid carcinogenesis. Conclusions Mining events involving genes and pathways from the literature and integrating curated pathway knowledge can help improve the understanding of molecular interactions of complex diseases. The system developed for this study can be applied in studies other than thyroid cancer. The source code is freely available online at https://github.com/chengkun-wu/PWTEES.
Collapse
|
20
|
An Overview of Biomolecular Event Extraction from Scientific Documents. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:571381. [PMID: 26587051 PMCID: PMC4637451 DOI: 10.1155/2015/571381] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Revised: 08/10/2015] [Accepted: 08/18/2015] [Indexed: 01/09/2023]
Abstract
This paper presents a review of state-of-the-art approaches to automatic extraction of biomolecular events from scientific texts. Events involving biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications, including support for information retrieval, knowledge summarization, and information extraction and discovery. However, automatic event extraction is a challenging task due to the ambiguity and diversity of natural language and higher-level linguistic phenomena, such as speculations and negations, which occur in biological texts and can lead to misunderstanding or incorrect interpretation. Many strategies have been proposed in the last decade, originating from different research areas such as natural language processing, machine learning, and statistics. This review summarizes the most representative approaches in biomolecular event extraction and presents an analysis of the current state of the art and of commonly used methods, features, and tools. Finally, current research trends and future perspectives are also discussed.
Collapse
|
21
|
Pyysalo S, Ohta T, Rak R, Rowley A, Chun HW, Jung SJ, Choi SP, Tsujii J, Ananiadou S. Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013. BMC Bioinformatics 2015; 16 Suppl 10:S2. [PMID: 26202570 PMCID: PMC4511510 DOI: 10.1186/1471-2105-16-s10-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Since their introduction in 2009, the BioNLP Shared Task events have been instrumental in advancing the development of methods and resources for the automatic extraction of information from the biomedical literature. In this paper, we present the Cancer Genetics (CG) and Pathway Curation (PC) tasks, two event extraction tasks introduced in the BioNLP Shared Task 2013. The CG task focuses on cancer, emphasizing the extraction of physiological and pathological processes at various levels of biological organization, and the PC task targets reactions relevant to the development of biomolecular pathway models, defining its extraction targets on the basis of established pathway representations and ontologies. RESULTS Six groups participated in the CG task and two groups in the PC task, together applying a wide range of extraction approaches including both established state-of-the-art systems and newly introduced extraction methods. The best-performing systems achieved F-scores of 55% on the CG task and 53% on the PC task, demonstrating a level of performance comparable to the best results achieved in similar previously proposed tasks. CONCLUSIONS The results indicate that existing event extraction technology can generalize to meet the novel challenges represented by the CG and PC task settings, suggesting that extraction methods are capable of supporting the construction of knowledge bases on the molecular mechanisms of cancer and the curation of biomolecular pathway models. The CG and PC tasks continue as open challenges for all interested parties, with data, tools and resources available from the shared task homepage.
Collapse
Affiliation(s)
- Sampo Pyysalo
- Department of Information technology, University of Turku, Turku, Finland
| | | | - Rafal Rak
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| | - Andrew Rowley
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| | - Hong-Woo Chun
- Software Research Center, Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea
| | - Sung-Jae Jung
- Software Research Center, Korea Institute of Science and Technology Information (KISTI), Daejeon, South Korea
- Department of Applied Information Science, University of Science and Technology (UST), Daejeon, South Korea
| | - Sung-Pil Choi
- Department of Library and Information Science, Kyonggi University, Suwon, South Korea
| | | | - Sophia Ananiadou
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK
| |
Collapse
|
22
|
Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Brief Funct Genomics 2015; 14:213-30. [PMID: 24907365 PMCID: PMC4499874 DOI: 10.1093/bfgp/elu015] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of 'events', i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research.
Collapse
|
23
|
Munkhdalai T, Namsrai OE, Ryu K. Self-training in significance space of support vectors for imbalanced biomedical event data. BMC Bioinformatics 2015; 16 Suppl 7:S6. [PMID: 25952719 PMCID: PMC4423724 DOI: 10.1186/1471-2105-16-s7-s6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Pairwise relationships extracted from biomedical literature are insufficient in formulating biomolecular interactions. Extraction of complex relations (namely, biomedical events) has become the main focus of the text-mining community. However, there are two critical issues that are seldom dealt with by existing systems. First, an annotated corpus for training a prediction model is highly imbalanced. Second, supervised models trained on only a single annotated corpus can limit system performance. Fortunately, there is a large pool of unlabeled data containing much of the domain background that one can exploit. Results In this study, we develop a new semi-supervised learning method to address the issues outlined above. The proposed algorithm efficiently exploits the unlabeled data to leverage system performance. We furthermore extend our algorithm to a two-phase learning framework. The first phase balances the training data for initial model induction. The second phase incorporates domain knowledge into the event extraction model. The effectiveness of our method is evaluated on the Genia event extraction corpus and a PubMed document pool. Our method can identify a small subset of the majority class, which is sufficient for building a well-generalized prediction model. It outperforms the traditional self-training algorithm in terms of f-measure. Our model, based on the training data and the unlabeled data pool, achieves comparable performance to the state-of-the-art systems that are trained on a larger annotated set consisting of training and evaluation data.
Collapse
|
24
|
Munkhdalai T, Li M, Batsuren K, Park HA, Choi NH, Ryu KH. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations. J Cheminform 2015; 7:S9. [PMID: 25810780 PMCID: PMC4331699 DOI: 10.1186/1758-2946-7-s1-s9] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data. Results We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface. BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.
Collapse
Affiliation(s)
- Tsendsuren Munkhdalai
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| | - Meijing Li
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| | - Khuyagbaatar Batsuren
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| | - Hyeon Ah Park
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| | - Nak Hyeon Choi
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| | - Keun Ho Ryu
- Database/Bioinformatics Laboratory, School of Electrical & Computer Engineering, Chungbuk National University, Cheongju, South Korea
| |
Collapse
|
25
|
Nie Y, Rong W, Zhang Y, Ouyang Y, Xiong Z. Embedding assisted prediction architecture for event trigger identification. J Bioinform Comput Biol 2015; 13:1541001. [PMID: 25669328 DOI: 10.1142/s0219720015410012] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Molecular events normally have significant meanings since they describe important biological interactions or alternations such as binding of a protein. As a crucial step of biological event extraction, event trigger identification has attracted much attention and many methods have been proposed. Traditionally those methods can be categorised into rule-based approach and machine learning approach and machine learning-based approaches have demonstrated its potential and outperformed rule-based approaches in many situations. However, machine learning-based approaches still face several challenges among which a notable one is how to model semantic and syntactic information of different words and incorporate it into the prediction model. There exist many ways to model semantic and syntactic information, among which word embedding is an effective one. Therefore, in order to address this challenge, in this study, a word embedding assisted neural network prediction model is proposed to conduct event trigger identification. The experimental study on commonly used dataset has shown its potential. It is believed that this study could offer researchers insights into semantic-aware solutions for event trigger identification.
Collapse
Affiliation(s)
- Yifan Nie
- Sino-French Engineer School, Beihang University, Beijing 100191, P. R. China
| | | | | | | | | |
Collapse
|
26
|
Wu C, Schwartz JM, Brabant G, Nenadic G. Molecular profiling of thyroid cancer subtypes using large-scale text mining. BMC Med Genomics 2014; 7 Suppl 3:S3. [PMID: 25521965 PMCID: PMC4290788 DOI: 10.1186/1755-8794-7-s3-s3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Thyroid cancer is the most common endocrine tumor with a steady increase in incidence. It is classified into multiple histopathological subtypes with potentially distinct molecular mechanisms. Identifying the most relevant genes and biological pathways reported in the thyroid cancer literature is vital for understanding of the disease and developing targeted therapeutics. Results We developed a large-scale text mining system to generate a molecular profiling of thyroid cancer subtypes. The system first uses a subtype classification method for the thyroid cancer literature, which employs a scoring scheme to assign different subtypes to articles. We evaluated the classification method on a gold standard derived from the PubMed Supplementary Concept annotations, achieving a micro-average F1-score of 85.9% for primary subtypes. We then used the subtype classification results to extract genes and pathways associated with different thyroid cancer subtypes and successfully unveiled important genes and pathways, including some instances that are missing from current manually annotated databases or most recent review articles. Conclusions Identification of key genes and pathways plays a central role in understanding the molecular biology of thyroid cancer. An integration of subtype context can allow prioritized screening for diagnostic biomarkers and novel molecular targeted therapeutics. Source code used for this study is made freely available online at https://github.com/chengkun-wu/GenesThyCan.
Collapse
|
27
|
RAVIKUMAR K, WAGHOLIKAR KAVISHWARB, LIU HONGFANG. Towards pathway curation through literature mining--a case study using PharmGKB. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2014:352-363. [PMID: 24297561 PMCID: PMC3902837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The creation of biological pathway knowledge bases is largely driven by manual effort to curate based on evidences from the scientific literature. It is highly challenging for the curators to keep up with the literature. Text mining applications have been developed in the last decade to assist human curators to speed up the curation pace where majority of them aim to identify the most relevant papers for curation with little attempt to directly extract the pathway information from text. In this paper, we describe a rule-based literature mining system to extract pathway information from text. We evaluated the system using curated pharmacokinetic (PK) and pharmacodynamic (PD) pathways in PharmGKB. The system achieved an F-measure of 63.11% and 34.99% for entity extraction and event extraction respectively against all PubMed abstracts cited in PharmGKB. It may be possible to improve the system performance by incorporating using statistical machine learning approaches. This study also helped us gain insights into the barriers towards automated event extraction from text for pathway curation.
Collapse
|
28
|
Haverinen K, Nyblom J, Viljanen T, Laippala V, Kohonen S, Missilä A, Ojala S, Salakoski T, Ginter F. Building the essential resources for Finnish: the Turku Dependency Treebank. LANG RESOUR EVAL 2013. [DOI: 10.1007/s10579-013-9244-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
29
|
Pyysalo S, Ohta T, Miwa M, Cho HC, Tsujii J, Ananiadou S. Event extraction across multiple levels of biological organization. ACTA ACUST UNITED AC 2013; 28:i575-i581. [PMID: 22962484 PMCID: PMC3436834 DOI: 10.1093/bioinformatics/bts407] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Motivation: Event extraction using expressive structured representations has been a significant focus of recent efforts in biomedical information extraction. However, event extraction resources and methods have so far focused almost exclusively on molecular-level entities and processes, limiting their applicability. Results: We extend the event extraction approach to biomedical information extraction to encompass all levels of biological organization from the molecular to the whole organism. We present the ontological foundations, target types and guidelines for entity and event annotation and introduce the new multi-level event extraction (MLEE) corpus, manually annotated using a structured representation for event extraction. We further adapt and evaluate named entity and event extraction methods for the new task, demonstrating that both can be achieved with performance broadly comparable with that for established molecular entity and event extraction tasks. Availability: The resources and methods introduced in this study are available from http://nactem.ac.uk/MLEE/. Contact:pyysalos@cs.man.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sampo Pyysalo
- National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester, UK.
| | | | | | | | | | | |
Collapse
|
30
|
Biomolecular event trigger detection using neighborhood hash features. J Theor Biol 2013; 318:22-8. [DOI: 10.1016/j.jtbi.2012.10.030] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Revised: 10/23/2012] [Accepted: 10/24/2012] [Indexed: 11/27/2022]
|
31
|
Raja K, Subramani S, Natarajan J. PPInterFinder--a mining tool for extracting causal relations on human proteins from literature. Database (Oxford) 2013; 2013:bas052. [PMID: 23325628 PMCID: PMC3548331 DOI: 10.1093/database/bas052] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2012] [Revised: 11/16/2012] [Accepted: 11/20/2012] [Indexed: 12/20/2022]
Abstract
One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems. DATABASE URL: http://www.biomining-bu.in/ppinterfinder/
Collapse
Affiliation(s)
| | | | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, 641046, Tamil Nadu, India
| |
Collapse
|
32
|
Abstract
New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.
Collapse
Affiliation(s)
- Mariana Neves
- Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany.
| | | |
Collapse
|
33
|
Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics 2012; 13 Suppl 11:S2. [PMID: 22759456 PMCID: PMC3384257 DOI: 10.1186/1471-2105-13-s11-s2] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.
Collapse
Affiliation(s)
- Sampo Pyysalo
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Tomoko Ohta
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Rafal Rak
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Dan Sullivan
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunhong Mao
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunxia Wang
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Bruno Sobral
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | | | - Sophia Ananiadou
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| |
Collapse
|
34
|
Kim JD, Nguyen N, Wang Y, Tsujii J, Takagi T, Yonezawa A. The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011. BMC Bioinformatics 2012; 13 Suppl 11:S1. [PMID: 22759455 PMCID: PMC3384256 DOI: 10.1186/1471-2105-13-s11-s1] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task. RESULTS The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score. CONCLUSIONS Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.
Collapse
Affiliation(s)
- Jin-Dong Kim
- Database Center for Life Science, Research Organization of Information and Science, 2-11-16 Yayoi, Bunkyo-ku, Tokyo, Japan
| | - Ngan Nguyen
- Department of Information science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
| | - Yue Wang
- Database Center for Life Science, Research Organization of Information and Science, 2-11-16 Yayoi, Bunkyo-ku, Tokyo, Japan
| | - Jun'ichi Tsujii
- Microsoft Research Asia, 5 Dan Ling Street, Haidian District, Beijing, China
| | - Toshihisa Takagi
- Department of Computational Biology, University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa, Chiba, Japan
| | - Akinori Yonezawa
- Database Center for Life Science, Research Organization of Information and Science, 2-11-16 Yayoi, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
35
|
Gerner M, Sarafraz F, Bergman CM, Nenadic G. BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events. Bioinformatics 2012; 28:2154-61. [PMID: 22711795 PMCID: PMC3413385 DOI: 10.1093/bioinformatics/bts332] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research. Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative. Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing. Contact:martin.gerner@gmail.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK.
| | | | | | | |
Collapse
|
36
|
Van Landeghem S, Hakala K, Rönnqvist S, Salakoski T, Van de Peer Y, Ginter F. Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations. Adv Bioinformatics 2012; 2012:582765. [PMID: 22719757 PMCID: PMC3375141 DOI: 10.1155/2012/582765] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Revised: 03/16/2012] [Accepted: 03/28/2012] [Indexed: 01/20/2023] Open
Abstract
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
Collapse
Affiliation(s)
- Sofie Van Landeghem
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Kai Hakala
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Samuel Rönnqvist
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
- Turku BioNLP Group, Turku Centre for Computer Science (TUCS), Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Filip Ginter
- Department of Information Technology, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| |
Collapse
|
37
|
Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S. Extracting semantically enriched events from biomedical literature. BMC Bioinformatics 2012; 13:108. [PMID: 22621266 PMCID: PMC3464657 DOI: 10.1186/1471-2105-13-108] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Accepted: 05/23/2012] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Research into event-based text mining from the biomedical literature has been growing in popularity to facilitate the development of advanced biomedical text mining systems. Such technology permits advanced search, which goes beyond document or sentence-based retrieval. However, existing event-based systems typically ignore additional information within the textual context of events that can determine, amongst other things, whether an event represents a fact, hypothesis, experimental result or analysis of results, whether it describes new or previously reported knowledge, and whether it is speculated or negated. We refer to such contextual information as meta-knowledge. The automatic recognition of such information can permit the training of systems allowing finer-grained searching of events according to the meta-knowledge that is associated with them. RESULTS Based on a corpus of 1,000 MEDLINE abstracts, fully manually annotated with both events and associated meta-knowledge, we have constructed a machine learning-based system that automatically assigns meta-knowledge information to events. This system has been integrated into EventMine, a state-of-the-art event extraction system, in order to create a more advanced system (EventMine-MK) that not only extracts events from text automatically, but also assigns five different types of meta-knowledge to these events. The meta-knowledge assignment module of EventMine-MK performs with macro-averaged F-scores in the range of 57-87% on the BioNLP'09 Shared Task corpus. EventMine-MK has been evaluated on the BioNLP'09 Shared Task subtask of detecting negated and speculated events. Our results show that EventMine-MK can outperform other state-of-the-art systems that participated in this task. CONCLUSIONS We have constructed the first practical system that extracts both events and associated, detailed meta-knowledge information from biomedical literature. The automatically assigned meta-knowledge information can be used to refine search systems, in order to provide an extra search layer beyond entities and assertions, dealing with phenomena such as rhetorical intent, speculations, contradictions and negations. This finer grained search functionality can assist in several important tasks, e.g., database curation (by locating new experimental knowledge) and pathway enrichment (by providing information for inference). To allow easy integration into text mining systems, EventMine-MK is provided as a UIMA component that can be used in the interoperable text mining infrastructure, U-Compare.
Collapse
Affiliation(s)
- Makoto Miwa
- The National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
- School of Computer Science and the Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, M1 7DN, UK
| | - Paul Thompson
- The National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
- School of Computer Science and the Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, M1 7DN, UK
| | - John McNaught
- The National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
- School of Computer Science and the Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, M1 7DN, UK
| | - Douglas B Kell
- School of Chemistry and the Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
| | - Sophia Ananiadou
- The National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK
- School of Computer Science and the Manchester Interdisciplinary Biocentre, University of Manchester, Manchester, M1 7DN, UK
| |
Collapse
|
38
|
Miwa M, Thompson P, Ananiadou S. Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics 2012; 28:1759-65. [PMID: 22539668 PMCID: PMC3381963 DOI: 10.1093/bioinformatics/bts237] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Motivation: In recent years, several biomedical event extraction (EE) systems have been developed. However, the nature of the annotated training corpora, as well as the training process itself, can limit the performance levels of the trained EE systems. In particular, most event-annotated corpora do not deal adequately with coreference. This impacts on the trained systems' ability to recognize biomedical entities, thus affecting their performance in extracting events accurately. Additionally, the fact that most EE systems are trained on a single annotated corpus further restricts their coverage. Results: We have enhanced our existing EE system, EventMine, in two ways. First, we developed a new coreference resolution (CR) system and integrated it with EventMine. The standalone performance of our CR system in resolving anaphoric references to proteins is considerably higher than the best ranked system in the COREF subtask of the BioNLP'11 Shared Task. Secondly, the improved EventMine incorporates domain adaptation (DA) methods, which extend EE coverage by allowing several different annotated corpora to be used during training. Combined with a novel set of methods to increase the generality and efficiency of EventMine, the integration of both CR and DA have resulted in significant improvements in EE, ranging between 0.5% and 3.4% F-Score. The enhanced EventMine outperforms the highest ranked systems from the BioNLP'09 shared task, and from the GENIA and Infectious Diseases subtasks of the BioNLP'11 shared task. Availability: The improved version of EventMine, incorporating the CR system and DA methods, is available at: http://www.nactem.ac.uk/EventMine/. Contact:makoto.miwa@manchester.ac.uk
Collapse
Affiliation(s)
- Makoto Miwa
- The National Centre for Text Mining (NaCTeM), UK.
| | | | | |
Collapse
|
39
|
Jamieson DG, Gerner M, Sarafraz F, Nenadic G, Robertson DL. Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas023. [PMID: 22529179 PMCID: PMC3332570 DOI: 10.1093/database/bas023] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked to 14,312 mentions in 3090 articles. The advancement of text-mining (TM) techniques has offered a possibility to rapidly retrieve such data from large volumes of text to a high degree of accuracy. Here, we present a recreation of the HHPID using the current state of the art in TM. To retrieve interactions, we performed gene/protein named entity recognition (NER) and applied two molecular event extraction tools on all abstracts and titles cited in the HHPID. Our best NER scores for precision, recall and F-score were 87.5%, 90.0% and 88.6%, respectively, while event extraction achieved 76.4%, 84.2% and 80.1%, respectively. We demonstrate that over 50% of the HHPID interactions can be recreated from abstracts and titles. Furthermore, from 49 available open-access full-text articles, we extracted a total of 237 unique HIV-1-human interactions, as opposed to 187 interactions recorded in the HHPID from the same articles. On average, we extracted 23 times more mentions of interactions and events from a full-text article than from an abstract and title, with a 6-fold increase in the number of unique interactions. We further demonstrated that more frequently occurring interactions extracted by TM are more likely to be true positives. Overall, the results demonstrate that TM was able to recover a large proportion of interactions, many of which were found within the HHPID, making TM a useful assistant in the manual curation process. Finally, we also retrieved other types of interactions in the context of HIV-1 that are not currently present in the HHPID, thus, expanding the scope of this data set. All data is available at http://gnode1.mib.man.ac.uk/HIV1-text-mining.
Collapse
Affiliation(s)
- Daniel G Jamieson
- Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK
| | | | | | | | | |
Collapse
|
40
|
Context-specific protein network miner--an online system for exploring context-specific protein interaction networks from the literature. PLoS One 2012; 7:e34480. [PMID: 22493694 PMCID: PMC3321019 DOI: 10.1371/journal.pone.0034480] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2011] [Accepted: 03/05/2012] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Protein interaction networks (PINs) specific within a particular context contain crucial information regarding many cellular biological processes. For example, PINs may include information on the type and directionality of interaction (e.g. phosphorylation), location of interaction (i.e. tissues, cells), and related diseases. Currently, very few tools are capable of deriving context-specific PINs for conducting exploratory analysis. RESULTS We developed a literature-based online system, Context-specific Protein Network Miner (CPNM), which derives context-specific PINs in real-time from the PubMed database based on a set of user-input keywords and enhanced PubMed query system. CPNM reports enriched information on protein interactions (with type and directionality), their network topology with summary statistics (e.g. most densely connected proteins in the network; most densely connected protein-pairs; and proteins connected by most inbound/outbound links) that can be explored via a user-friendly interface. Some of the novel features of the CPNM system include PIN generation, ontology-based PubMed query enhancement, real-time, user-queried, up-to-date PubMed document processing, and prediction of PIN directionality. CONCLUSIONS CPNM provides a tool for biologists to explore PINs. It is freely accessible at http://www.biotextminer.com/CPNM/.
Collapse
|
41
|
Barbosa-Silva A, Fontaine JF, Donnard ER, Stussi F, Ortega JM, Andrade-Navarro MA. PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries. BMC Bioinformatics 2011; 12:435. [PMID: 22070195 PMCID: PMC3228910 DOI: 10.1186/1471-2105-12-435] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2011] [Accepted: 11/09/2011] [Indexed: 12/03/2022] Open
Abstract
Background Biological function is greatly dependent on the interactions of proteins with other proteins and genes. Abstracts from the biomedical literature stored in the NCBI's PubMed database can be used for the derivation of interactions between genes and proteins by identifying the co-occurrences of their terms. Often, the amount of interactions obtained through such an approach is large and may mix processes occurring in different contexts. Current tools do not allow studying these data with a focus on concepts of relevance to a user, for example, interactions related to a disease or to a biological mechanism such as protein aggregation. Results To help the concept-oriented exploration of such data we developed PESCADOR, a web tool that extracts a network of interactions from a set of PubMed abstracts given by a user, and allows filtering the interaction network according to user-defined concepts. We illustrate its use in exploring protein aggregation in neurodegenerative disease and in the expansion of pathways associated to colon cancer. Conclusions PESCADOR is a platform independent web resource available at: http://cbdm.mdc-berlin.de/tools/pescador/
Collapse
Affiliation(s)
- Adriano Barbosa-Silva
- Computational Biology and Data Mining Group, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Strasse, 10, D-13125, Berlin, Germany.
| | | | | | | | | | | |
Collapse
|
42
|
Tsuruoka Y, Miwa M, Hamamoto K, Tsujii J, Ananiadou S. Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 2011; 27:i111-9. [PMID: 21685059 PMCID: PMC3117364 DOI: 10.1093/bioinformatics/btr214] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. Contact:tsuruoka@jaist.ac.jp
Collapse
Affiliation(s)
- Yoshimasa Tsuruoka
- School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), Nomi, Japan.
| | | | | | | | | |
Collapse
|
43
|
Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, Monachini M, Pezik P, Quochi V, Rupp CJ, Sasaki Y, Venturi G, Rebholz-Schuhmann D, Ananiadou S. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011; 12:397. [PMID: 21992002 PMCID: PMC3228855 DOI: 10.1186/1471-2105-12-397] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 10/12/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. RESULTS This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. CONCLUSIONS The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.
Collapse
Affiliation(s)
- Paul Thompson
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Abstract
BACKGROUND Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available. RESULTS In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology. CONCLUSIONS We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.
Collapse
Affiliation(s)
- Sampo Pyysalo
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Tomoko Ohta
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Jun’ichi Tsujii
- Department of Computer Science, University of Tokyo, Tokyo, Japan
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| |
Collapse
|
45
|
Abstract
Background We consider the task of automatically extracting DNA methylation events from the biomedical domain literature. DNA methylation is a key mechanism of epigenetic control of gene expression and implicated in many cancers, but there has been little study of automatic information extraction for DNA methylation. Results We present an annotation scheme for DNA methylation following the representation of the BioNLP shared task on event extraction, select a set of 200 abstracts including a representative sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for this corpus marking nearly 3000 gene/protein mentions and 1500 DNA methylation and demethylation events. We retrain a state-of-the-art event extraction system on the corpus and find that automatic extraction of DNA methylation events, the methylated genes, and their methylation sites can be performed at 78% precision and 76% recall. Conclusions Our results demonstrate that reliable extraction methods for DNA methylation events can be created through corpus annotation and straightforward retraining of a general event extraction system. The introduced resources are freely available for use in research from the GENIA project homepage http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA.
Collapse
|
46
|
Abstract
BACKGROUND Identifying protein-protein interactions (PPIs) from literature is an important step in mining the function of individual proteins as well as their biological network. Since it is known that PPIs have distinctive patterns in text, machine learning approaches have been successfully applied to mine these patterns. However, the complex nature of PPI description makes the extraction process difficult. RESULTS Our approach utilizes both word and syntactic features to effectively capture PPI patterns from biomedical literature. The proposed method automatically identifies gene names by a Priority Model, then extracts grammar relations using a dependency parser. A large margin classifier with Huber loss function learns from the extracted features, and unknown articles are predicted using this data-driven model. For the BioCreative III ACT evaluation, our official runs were ranked in top positions by obtaining maximum 89.15% accuracy, 61.42% F1 score, 0.55306 MCC score, and 67.98% AUC iP/R score. CONCLUSIONS Even though problems still remain, utilizing syntactic information for article-level filtering helps improve PPI ranking performance. The proposed system is a revision of previously developed algorithms in our group for the ACT evaluation. Our approach is valuable in showing how to use grammatical relations for PPI article filtering, in particular, with a limited training corpus. While current performance is far from satisfactory as an annotation tool, it is already useful for a PPI article search engine since users are mainly focused on highly-ranked results.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
47
|
Bell L, Chowdhary R, Liu JS, Niu X, Zhang J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS One 2011; 6:e21474. [PMID: 21738677 PMCID: PMC3124513 DOI: 10.1371/journal.pone.0021474] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2011] [Accepted: 06/01/2011] [Indexed: 01/26/2023] Open
Abstract
A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs.
Collapse
Affiliation(s)
- Lindsey Bell
- Department of Statistics, Florida State University, Tallahassee, Florida, United States of America
| | | | | | | | | |
Collapse
|
48
|
Islamaj Doğan R, Névéol A, Lu Z. A context-blocks model for identifying clinical relationships in patient records. BMC Bioinformatics 2011; 12 Suppl 3:S3. [PMID: 21658290 PMCID: PMC3111589 DOI: 10.1186/1471-2105-12-s3-s3] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Patient records contain valuable information regarding explanation of diagnosis, progression of disease, prescription and/or effectiveness of treatment, and more. Automatic recognition of clinically important concepts and the identification of relationships between those concepts in patient records are preliminary steps for many important applications in medical informatics, ranging from quality of care to hypothesis generation. Methods In this work we describe an approach that facilitates the automatic recognition of eight relationships defined between medical problems, treatments and tests. Unlike the traditional bag-of-words representation, in this work, we represent a relationship with a scheme of five distinct context-blocks determined by the position of concepts in the text. As a preliminary step to relationship recognition, and in order to provide an end-to-end system, we also addressed the automatic extraction of medical problems, treatments and tests. Our approach combined the outcome of a statistical model for concept recognition and simple natural language processing features in a conditional random fields model. A set of 826 patient records from the 4th i2b2 challenge was used for training and evaluating the system. Results Results show that our concept recognition system achieved an F-measure of 0.870 for exact span concept detection. Moreover the context-block representation of relationships was more successful (F-Measure = 0.775) at identifying relationships than bag-of-words (F-Measure = 0.402). Most importantly, the performance of the end-to-end system of relationship extraction using automatically extracted concepts (F-Measure = 0.704) was comparable to that obtained using manually annotated concepts (F-Measure = 0.711), and their difference was not statistically significant. Conclusions We extracted important clinical relationships from text in an automated manner, starting with concept recognition, and ending with relationship identification. The advantage of the context-blocks representation scheme was the correct management of word position information, which may be critical in identifying certain relationships. Our results may serve as benchmark for comparison to other systems developed on i2b2 challenge data. Finally, our system may serve as a preliminary step for other discovery tasks in medical informatics.
Collapse
Affiliation(s)
- Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA.
| | | | | |
Collapse
|
49
|
Doğan RI, Névéol A, Lu Z. A textual representation scheme for identifying clinical relationships in patient records. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS. INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS 2010; 2010:995-998. [PMID: 21552455 DOI: 10.1109/icmla.2010.164] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The identification of relationships between clinical concepts in patient records is a preliminary step for many important applications in medical informatics, ranging from quality of care to hypothesis generation. In this work we describe an approach that facilitates the automatic recognition of relationships defined between two different concepts in text. Unlike the traditional bag-of-words representation, in this work, a relationship is represented with a scheme of five distinct context-blocks based on the position of concepts in the text. This scheme was applied to eight different relationships, between medical problems, treatments and tests, on a set of 349 patient records from the 4th i2b2 challenge. Results show that the context-block representation was very successful (F-Measure = 0.775) compared to the bag-of-words model (F-Measure = 0.402). The advantage of this representation scheme was the correct management of word position information, which may be critical in identifying certain relationships.
Collapse
|
50
|
Bui QC, Katrenko S, Sloot PMA. A hybrid approach to extract protein-protein interactions. Bioinformatics 2010; 27:259-65. [PMID: 21062765 DOI: 10.1093/bioinformatics/btq620] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) play an important role in understanding biological processes. Although recent research in text mining has achieved a significant progress in automatic PPI extraction from literature, performance of existing systems still needs to be improved. RESULTS In this study, we propose a novel algorithm for extracting PPIs from literature which consists of two phases. First, we automatically categorize the data into subsets based on its semantic properties and extract candidate PPI pairs from these subsets. Second, we apply support vector machines (SVMs) to classify candidate PPI pairs using features specific for each subset. We obtain promising results on five benchmark datasets: AIMed, BioInfer, HPRD50, IEPA and LLL with F-scores ranging from 60% to 84%, which are comparable with the state-of-the-art PPI extraction systems. Furthermore, our system achieves the best performance on cross-corpora evaluation and comparative performance in terms of computational efficiency. AVAILABILITY The source code and scripts used in this article are available for academic use at http://staff.science.uva.nl/~bui/PPIs.zip CONTACT bqchinh@gmail.com.
Collapse
Affiliation(s)
- Quoc-Chinh Bui
- Computational Science, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands.
| | | | | |
Collapse
|