1
|
Sosa DN, Altman RB. Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference. Brief Bioinform 2022; 23:bbac268. [PMID: 35817308 PMCID: PMC9294417 DOI: 10.1093/bib/bbac268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/25/2022] [Accepted: 06/07/2022] [Indexed: 11/30/2022] Open
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
Collapse
Affiliation(s)
- Daniel N Sosa
- Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| | - Russ B Altman
- Department of Biological Engineering; Department of Genetics; Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| |
Collapse
|
2
|
Causal Biological Network Model for Inflammasome Signaling Applied for Interpreting Transcriptomic Changes in Various Inflammatory States. Int J Inflam 2022; 2022:4071472. [PMID: 35126992 PMCID: PMC8813300 DOI: 10.1155/2022/4071472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Accepted: 12/27/2021] [Indexed: 11/17/2022] Open
Abstract
Virtually any stressor that alters the cellular homeostatic state may result in an inflammatory response. As a critical component of innate immunity, inflammasomes play a prominent role in the inflammatory response. The information on inflammasome biology is rapidly growing, thus creating the need for structuring it into a model that can help visualize and enhance the understanding of underlying biological processes. Causal biological network (CBN) models provide predictive power for novel disease mechanisms and treatment outcomes. We assembled the available literature information on inflammasome activation into the CBN model and scored it with publicly available transcriptomic datasets that address viral infection of the lungs, osteo- and rheumatoid arthritis, psoriasis, and aging. The scoring inferred pathway activation leading to NLRP3 inflammasome activation in these diverse conditions, demonstrating that the CBN model provides a platform for interpreting transcriptomic data in the context of inflammasome activation.
Collapse
|
3
|
Shao Y, Li H, Gu J, Qian L, Zhou G. Extraction of causal relations based on SBEL and BERT model. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6133143. [PMID: 33570092 PMCID: PMC7904051 DOI: 10.1093/database/baab005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 01/19/2021] [Accepted: 01/26/2021] [Indexed: 11/15/2022]
Abstract
Extraction of causal relations between biomedical entities in the form of Biological Expression Language (BEL) poses a new challenge to the community of biomedical text mining due to the complexity of BEL statements. We propose a simplified form of BEL statements [Simplified Biological Expression Language (SBEL)] to facilitate BEL extraction and employ BERT (Bidirectional Encoder Representation from Transformers) to improve the performance of causal relation extraction (RE). On the one hand, BEL statement extraction is transformed into the extraction of an intermediate form—SBEL statement, which is then further decomposed into two subtasks: entity RE and entity function detection. On the other hand, we use a powerful pretrained BERT model to both extract entity relations and detect entity functions, aiming to improve the performance of two subtasks. Entity relations and functions are then combined into SBEL statements and finally merged into BEL statements. Experimental results on the BioCreative-V Track 4 corpus demonstrate that our method achieves the state-of-the-art performance in BEL statement extraction with F1 scores of 54.8% in Stage 2 evaluation and of 30.1% in Stage 1 evaluation, respectively. Database URL: https://github.com/grapeff/SBEL_datasets
Collapse
Affiliation(s)
- Yifan Shao
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Haoru Li
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Jinghang Gu
- Department of Chinese & Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China, 999077
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| |
Collapse
|
4
|
Lang PF, Chebaro Y, Zheng X, P Sekar JA, Shaikh B, Natale DA, Karr JR. BpForms and BcForms: a toolkit for concretely describing non-canonical polymers and complexes to facilitate global biochemical networks. Genome Biol 2020; 21:117. [PMID: 32423472 PMCID: PMC7236495 DOI: 10.1186/s13059-020-02025-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 04/16/2020] [Indexed: 12/12/2022] Open
Abstract
Non-canonical residues, caps, crosslinks, and nicks are important to many functions of DNAs, RNAs, proteins, and complexes. However, we do not fully understand how networks of such non-canonical macromolecules generate behavior. One barrier is our limited formats for describing macromolecules. To overcome this barrier, we develop BpForms and BcForms, a toolkit for representing the primary structure of macromolecules as combinations of residues, caps, crosslinks, and nicks. The toolkit can help omics researchers perform quality control and exchange information about macromolecules, help systems biologists assemble global models of cells that encompass processes such as post-translational modification, and help bioengineers design cells.
Collapse
Affiliation(s)
- Paul F Lang
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Biochemistry, University of Oxford, South Parks Road, Oxford, OX1 3QU, UK
| | - Yassmine Chebaro
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Institut de Génétique et de Biologie Moléculaire et Cellulaire, Institut National de la Santé et de la Recherche Médicale, Centre National de la Recherche Scientifique, Université de Strasbourg, Illkirch, 67404, France
| | - Xiaoyue Zheng
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
| | - John A P Sekar
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
| | - Bilal Shaikh
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, 20007, USA
| | - Jonathan R Karr
- Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA.
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, 10029, NY, USA.
| |
Collapse
|
5
|
Madan S, Szostak J, Komandur Elayavilli R, Tsai RTH, Ali M, Qian L, Rastegar-Mojarad M, Hoeng J, Fluck J. The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5585579. [PMID: 31603193 PMCID: PMC6787548 DOI: 10.1093/database/baz084] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 05/22/2019] [Accepted: 05/31/2019] [Indexed: 01/12/2023]
Abstract
Knowledge of the molecular interactions of biological and chemical entities and their involvement in biological processes or clinical phenotypes is important for data interpretation. Unfortunately, this knowledge is mostly embedded in the literature in such a way that it is unavailable for automated data analysis procedures. Biological expression language (BEL) is a syntax representation allowing for the structured representation of a broad range of biological relationships. It is used in various situations to extract such knowledge and transform it into BEL networks. To support the tedious and time-intensive extraction work of curators with automated methods, we developed the BEL track within the framework of BioCreative Challenges. Within the BEL track, we provide training data and an evaluation environment to encourage the text mining community to tackle the automatic extraction of complex BEL relationships. In 2017 BioCreative VI, the 2015 BEL track was repeated with new test data. Although only minor improvements in text snippet retrieval for given statements were achieved during this second BEL task iteration, a significant increase of BEL statement extraction performance from provided sentences could be seen. The best performing system reached a 32% F-score for the extraction of complete BEL statements and with the given named entities this increased to 49%. This time, besides rule-based systems, new methods involving hierarchical sequence labeling and neural networks were applied for BEL statement extraction.
Collapse
Affiliation(s)
- Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchatel, Switzerland
| | | | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, Taiwan, R.O.C., Taiwan 320
| | - Mehdi Ali
- Friedrich Wilhelm University of Bonn, 53012 Bonn, Germany
| | - Longhua Qian
- NLP Lab, School of Computer Science and Technology, Soochow University, Suzhou, 215006 Suzhou, China
| | - Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic, 200 First St. SW, Rochester, MN 55905, USA
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000 Neuchatel, Switzerland
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 Sankt Augustin, Germany
| |
Collapse
|
6
|
Liu S, Shao Y, Qian L, Zhou G. Hierarchical sequence labeling for extracting BEL statements from biomedical literature. BMC Med Inform Decis Mak 2019; 19:63. [PMID: 30961584 PMCID: PMC6454591 DOI: 10.1186/s12911-019-0758-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Extracting relations between bio-entities from biomedical literature is often a challenging task and also an essential step towards biomedical knowledge expansion. The BioCreative community has organized a shared task to evaluate the robustness of the causal relationship extraction algorithms in Biological Expression Language (BEL) from biomedical literature. Method We first map the sentence-level BEL statements in the BC-V training corpus to the corresponding text segments, thus generating hierarchically tagged training instances. A hierarchical sequence labeling model was afterwards induced from these training instances and applied to the test sentences in order to construct the BEL statements. Results The experimental results on extracting BEL statements from BioCreative V Track 4 test corpus show that our method achieves promising performance with an overall F-measure of 31.6%. Furthermore, it has the potential to be enhanced by adopting more advanced machine learning approaches. Conclusion We propose a framework for hierarchical relation extraction using hierarchical sequence labeling on the instance-level training corpus derived from the original sentence-level corpus via word alignment. Its main advantage is that we can make full use of the original training corpus to induce the sequence labelers and then apply them to the test corpus.
Collapse
Affiliation(s)
- Suwen Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Yifan Shao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
7
|
Saqi M, Lysenko A, Guo YK, Tsunoda T, Auffray C. Navigating the disease landscape: knowledge representations for contextualizing molecular signatures. Brief Bioinform 2019; 20:609-623. [PMID: 29684165 PMCID: PMC6556902 DOI: 10.1093/bib/bby025] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 02/05/2018] [Indexed: 12/14/2022] Open
Abstract
Large amounts of data emerging from experiments in molecular medicine are leading to the identification of molecular signatures associated with disease subtypes. The contextualization of these patterns is important for obtaining mechanistic insight into the aberrant processes associated with a disease, and this typically involves the integration of multiple heterogeneous types of data. In this review, we discuss knowledge representations that can be useful to explore the biological context of molecular signatures, in particular three main approaches, namely, pathway mapping approaches, molecular network centric approaches and approaches that represent biological statements as knowledge graphs. We discuss the utility of each of these paradigms, illustrate how they can be leveraged with selected practical examples and identify ongoing challenges for this field of research.
Collapse
Affiliation(s)
- Mansoor Saqi
- Mansoor Saqi Data Science Institute, Imperial College London, UK
| | - Artem Lysenko
- Artem Lysenko Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Yi-Ke Guo
- Yi-Ke Guo Data Science Institute, Imperial College London, UK
| | - Tatsuhiko Tsunoda
- Tatsuhiko Tsunoda Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan CREST, JST, Tokyo, Japan Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| | - Charles Auffray
- Charles Auffray European Institute for Systems Biology and Medicine, Lyon, France
| |
Collapse
|
8
|
Liu S, Cheng W, Qian L, Zhou G. Combining relation extraction with function detection for BEL statement extraction. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5277249. [PMID: 30624649 PMCID: PMC6323300 DOI: 10.1093/database/bay133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 11/26/2018] [Indexed: 11/29/2022]
Abstract
The BioCreative-V community proposed a challenging task of automatic extraction of causal relation network in Biological Expression Language (BEL) from the biomedical literature. Previous studies on this task largely used models induced from other related tasks and then transformed intermediate structures to BEL statements, which left the given training corpus unexplored. To make full use of the BEL training corpus, in this work, we propose a deep learning-based approach to extract BEL statements. Specifically, we decompose the problem into two subtasks: entity relation extraction and entity function detection. First, two attention-based bidirectional long short-term memory networks models are used to extract entity relation and entity function, respectively. Then entity relation and their functions are combined into a BEL statement. In order to boost the overall performance, a strategy of threshold filtering is applied to improve the precision of identified entity functions. We evaluate our approach on the BioCreative-V Track 4 corpus with or without gold entities. The experimental results show that our method achieves the state-of-the-art performance with an overall F1-measure of 46.9% in stage 2 and 21.3% in stage 1, respectively.
Collapse
Affiliation(s)
- Suwen Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Wei Cheng
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
9
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
10
|
Ravikumar KE, Rastegar-Mojarad M, Liu H. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3053439. [PMID: 28365720 PMCID: PMC5467463 DOI: 10.1093/database/baw156] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 11/07/2016] [Indexed: 12/22/2022]
Abstract
Extracting meaningful relationships with semantic significance from biomedical literature is often a challenging task. BioCreative V track4 challenge for the first time has organized a comprehensive shared task to test the robustness of the text-mining algorithms in extracting semantically meaningful assertions from the evidence statement in biomedical text. In this work, we tested the ability of a rule-based semantic parser to extract Biological Expression Language (BEL) statements from evidence sentences culled out of biomedical literature as part of BioCreative V Track4 challenge. The system achieved an overall best F-measure of 21.29% in extracting the complete BEL statement. For relation extraction, the system achieved an F-measure of 65.13% on test data set. Our system achieved the best performance in five of the six criteria that was adopted for evaluation by the task organizers. Lack of ability to derive semantic inferences, limitation in the rule sets to map the textual extractions to BEL function were some of the reasons for low performance in extracting the complete BEL statement. Post shared task we also evaluated the impact of differential NER components on the ability to extract BEL statements on the test data sets besides making a single change in the rule sets that translate relation extractions into a BEL statement. There is a marked improvement by over 20% in the overall performance of the BELMiner’s capability to extract BEL statement on the test set. The system is available as a REST-API at http://54.146.11.205:8484/BELXtractor/finder/ Database URL:http://54.146.11.205:8484/BELXtractor/finder/
Collapse
Affiliation(s)
- K E Ravikumar
- Department of Health Sciences Research, Mayo Clinic, USA and
| | - Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic, USA and.,Department of Health Informatics and Administration, University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, USA and
| |
Collapse
|
11
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.,Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada
| |
Collapse
|
12
|
Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Collaborative relation annotation and quality analysis in Markyt environment. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4693828. [PMID: 29220479 PMCID: PMC5737204 DOI: 10.1093/database/bax090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 11/09/2017] [Indexed: 11/30/2022]
Abstract
Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org. Database URL:http://markyt.org
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Gael Pérez-Rodríguez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Florentino Fdez-Riverola
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Anália Lourenço
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain.,CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
13
|
Madan S, Hodapp S, Senger P, Ansari S, Szostak J, Hoeng J, Peitsch M, Fluck J. The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track. Database (Oxford) 2016; 2016:baw136. [PMID: 27694210 PMCID: PMC5045868 DOI: 10.1093/database/baw136] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Revised: 08/26/2016] [Accepted: 08/30/2016] [Indexed: 11/14/2022]
Abstract
Network-based approaches have become extremely important in systems biology to achieve a better understanding of biological mechanisms. For network representation, the Biological Expression Language (BEL) is well designed to collate findings from the scientific literature into biological network models. To facilitate encoding and biocuration of such findings in BEL, a BEL Information Extraction Workflow (BELIEF) was developed. BELIEF provides a web-based curation interface, the BELIEF Dashboard, that incorporates text mining techniques to support the biocurator in the generation of BEL networks. The underlying UIMA-based text mining pipeline (BELIEF Pipeline) uses several named entity recognition processes and relationship extraction methods to detect concepts and BEL relationships in literature. The BELIEF Dashboard allows easy curation of the automatically generated BEL statements and their context annotations. Resulting BEL statements and their context annotations can be syntactically and semantically verified to ensure consistency in the BEL network. In summary, the workflow supports experts in different stages of systems biology network building. Based on the BioCreative V BEL track evaluation, we show that the BELIEF Pipeline automatically extracts relationships with an F-score of 36.4% and fully correct statements can be obtained with an F-score of 30.8%. Participation in the BioCreative V Interactive task (IAT) track with BELIEF revealed a systems usability scale (SUS) of 67. Considering the complexity of the task for new users-learning BEL, working with a completely new interface, and performing complex curation-a score so close to the overall SUS average highlights the usability of BELIEF.Database URL: BELIEF is available at http://www.scaiview.com/belief/.
Collapse
Affiliation(s)
- Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sven Hodapp
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Philipp Senger
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sam Ansari
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Manuel Peitsch
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| |
Collapse
|
14
|
Rinaldi F, Ellendorff TR, Madan S, Clematide S, van der Lek A, Mevissen T, Fluck J. BioCreative V track 4: a shared task for the extraction of causal network information using the Biological Expression Language. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw067. [PMID: 27402677 PMCID: PMC4940434 DOI: 10.1093/database/baw067] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 04/11/2016] [Indexed: 12/27/2022]
Abstract
Automatic extraction of biological network information is one of the most desired and most complex tasks in biological and medical text mining. Track 4 at BioCreative V attempts to approach this complexity using fragments of large-scale manually curated biological networks, represented in Biological Expression Language (BEL), as training and test data. BEL is an advanced knowledge representation format which has been designed to be both human readable and machine processable. The specific goal of track 4 was to evaluate text mining systems capable of automatically constructing BEL statements from given evidence text, and of retrieving evidence text for given BEL statements. Given the complexity of the task, we designed an evaluation methodology which gives credit to partially correct statements. We identified various levels of information expressed by BEL statements, such as entities, functions, relations, and introduced an evaluation framework which rewards systems capable of delivering useful BEL fragments at each of these levels. The aim of this evaluation method is to help identify the characteristics of the systems which, if combined, would be most useful for achieving the overall goal of automatically constructing causal biological networks from text.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | | | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Simon Clematide
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Adrian van der Lek
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Theo Mevissen
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| |
Collapse
|
15
|
Choi M, Liu H, Baumgartner W, Zobel J, Verspoor K. Coreference resolution improves extraction of Biological Expression Language statements from texts. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw076. [PMID: 27374122 PMCID: PMC4930833 DOI: 10.1093/database/baw076] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/21/2016] [Indexed: 01/07/2023]
Abstract
We describe a system that automatically extracts biological events from biomedical journal articles, and translates those events into Biological Expression Language (BEL) statements. The system incorporates existing text mining components for coreference resolution, biological event extraction and a previously formally untested strategy for BEL statement generation. Although addressing the BEL track (Track 4) at BioCreative V (2015), we also investigate how incorporating coreference resolution might impact event extraction in the biomedical domain. In this paper, we report that our system achieved the best performance of 20.2 and 35.2 in F-score for the full BEL statement level on both stage 1, and stage 2 using provided gold standard entities, respectively. We also report that our results evaluated on the training dataset show benefit from integrating coreference resolution with event extraction.
Collapse
Affiliation(s)
- Miji Choi
- Department of Computing and Information Systems, the University of Melbourne National ICT Australia (NICTA) Victoria Research Laboratory, Parkville, Victoria, Australia
| | | | | | - Justin Zobel
- Department of Computing and Information Systems, the University of Melbourne
| | - Karin Verspoor
- Department of Computing and Information Systems, the University of Melbourne
| |
Collapse
|
16
|
Lai PT, Lo YY, Huang MS, Hsiao YC, Tsai RTH. BelSmile: a biomedical semantic role labeling approach for extracting biological expression language from text. Database (Oxford) 2016; 2016:baw064. [PMID: 27173520 PMCID: PMC4865328 DOI: 10.1093/database/baw064] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2015] [Revised: 04/08/2016] [Accepted: 04/11/2016] [Indexed: 02/04/2023]
Abstract
Biological expression language (BEL) is one of the most popular languages to represent the causal and correlative relationships among biological events. Automatically extracting and representing biomedical events using BEL can help biologists quickly survey and understand relevant literature. Recently, many researchers have shown interest in biomedical event extraction. However, the task is still a challenge for current systems because of the complexity of integrating different information extraction tasks such as named entity recognition (NER), named entity normalization (NEN) and relation extraction into a single system. In this study, we introduce our BelSmile system, which uses a semantic-role-labeling (SRL)-based approach to extract the NEs and events for BEL statements. BelSmile combines our previous NER, NEN and SRL systems. We evaluate BelSmile using the BioCreative V BEL task dataset. Our system achieved an F-score of 27.8%, ∼7% higher than the top BioCreative V system. The three main contributions of this study are (i) an effective pipeline approach to extract BEL statements, and (ii) a syntactic-based labeler to extract subject-verb-object tuples. We also implement a web-based version of BelSmile (iii) that is publicly available at iisrserv.csie.ncu.edu.tw/belsmile.
Collapse
Affiliation(s)
- Po-Ting Lai
- Department of Computer Science, National Tsing-Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu, Taiwan 30013, Republic of China
| | - Yu-Yan Lo
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Road, Zhongli, Taoyuan, Taiwan 320, Republic of China and
| | - Ming-Siang Huang
- Department of Clinical Laboratory Sciences and Medical Biotechnology, College of Medicine, National Taiwan University, No.1, Section 1, Renai Road, Taipei, Taiwan 10002, Republic of China
| | - Yu-Cheng Hsiao
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Road, Zhongli, Taoyuan, Taiwan 320, Republic of China and
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Road, Zhongli, Taoyuan, Taiwan 320, Republic of China and
| |
Collapse
|
17
|
Rastegar-Mojarad M, Komandur Elayavilli R, Liu H. BELTracker: evidence sentence retrieval for BEL statements. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw079. [PMID: 27173525 PMCID: PMC4865361 DOI: 10.1093/database/baw079] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/22/2016] [Indexed: 01/09/2023]
Abstract
Biological expression language (BEL) is one of the main formal representation models of biological networks. The primary source of information for curating biological networks in BEL representation has been literature. It remains a challenge to identify relevant articles and the corresponding evidence statements for curating and validating BEL statements. In this paper, we describe BELTracker, a tool used to retrieve and rank evidence sentences from PubMed abstracts and full-text articles for a given BEL statement (per the 2015 task requirements of BioCreative V BEL Task). The system is comprised of three main components, (i) translation of a given BEL statement to an information retrieval (IR) query, (ii) retrieval of relevant PubMed citations and (iii) finding and ranking the evidence sentences in those citations. BELTracker uses a combination of multiple approaches based on traditional IR, machine learning, and heuristics to accomplish the task. The system identified and ranked at least one fully relevant evidence sentence in the top 10 retrieved sentences for 72 out of 97 BEL statements in the test set. BELTracker achieved a precision of 0.392, 0.532 and 0.615 when evaluated with three criteria, namely full, relaxed and context criteria, respectively, by the task organizers. Our team at Mayo Clinic was the only participant in this task. BELTracker is available as a RESTful API and is available for public use. Database URL:http://www.openbionlp.org:8080/BelTracker/finder/Given_BEL_Statement
Collapse
Affiliation(s)
- Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic, USA University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, USA
| |
Collapse
|