1
|
Liu D, Zhang Y, Yang M, Yuan J, Qu W. Extracting Mutant-Affected Protein-Protein Interactions via Gaussian-Enhanced Representation and Contrastive Learning. J Comput Biol 2023; 30:972-984. [PMID: 37682321 DOI: 10.1089/cmb.2023.0080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023] Open
Abstract
Genetic mutations can impact protein-protein interactions (PPIs) in biomedical literature. Automated extraction of PPIs affected by gene mutations from biomedical literature can aid in evaluating the clinical importance of gene variations, which is crucial for the advancement of precision medicine. In this study, a new model called the Gaussian-enhanced representation model (GRM) is introduced for PPI extraction. The model utilizes the Gaussian probability distribution to produce a target entity representation based on the BioBERT pretraining model. The GRM assigns more weight to target protein entities and their adjacent entities, resolving the problem of lengthy input text and scattered distribution of target entities in the PPI extraction task. Additionally, the model introduces a supervised contrast learning approach to enhance its effectiveness and robustness. Experiments on the BioCreative VI data set demonstrate that our proposed GRM model has achieved state-of-the-art performance.
Collapse
Affiliation(s)
- Da Liu
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Yijia Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Ming Yang
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Jianyuan Yuan
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| | - Wen Qu
- School of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, China
| |
Collapse
|
2
|
Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung PY, Zhao T, He Z, Zhang J. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 2020; 21:773. [PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 10/26/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.
Collapse
Affiliation(s)
- Jinchan Qu
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Albert Steppi
- Laboratory of Systems Pharmacology at Harvard Medical School, Boston, MA, 02115, USA
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jie Hao
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jian Wang
- CloudMedx, Palo Alto, CA, 94301, USA
| | - Pei-Yau Lung
- Verisk - Insurance Solutions, Middletown, CT, 06457, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, 32306, USA
| | - Zhe He
- College of Communication and Information, Florida State University, Tallahassee, FL, 32306, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA.
| |
Collapse
|
3
|
Islamaj R, Kwon D, Kim S, Lu Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res 2020; 48:W5-W11. [PMID: 32383756 DOI: 10.1093/nar/gkaa333] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 12/20/2022] Open
Abstract
Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for figure display, project management, and multi-user team annotation. In response, we developed TeamTat (https://www.teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC (uploaded locally or automatically retrieved from PubMed/PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotator's convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Dongseop Kwon
- School of Software Convergence, Myongji University, Seoul 03674, South Korea
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
4
|
Stojanov R, Popovski G, Jofce N, Trajanov D, Seljak BK, Eftimov T. FoodViz: Visualization of Food Entities Linked Across Different Standards. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE 2020. [DOI: 10.1007/978-3-030-64580-9_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
5
|
Pérez-Pérez M, Pérez-Rodríguez G, Blanco-Míguez A, Fdez-Riverola F, Valencia A, Krallinger M, Lourenço A. Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm. J Cheminform 2019; 11:42. [PMID: 31236786 PMCID: PMC6591930 DOI: 10.1186/s13321-019-0363-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 06/09/2019] [Indexed: 11/23/2022] Open
Abstract
Background Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. Results A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. Conclusions The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents. Electronic supplementary material The online version of this article (10.1186/s13321-019-0363-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Martin Pérez-Pérez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Gael Pérez-Rodríguez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Aitor Blanco-Míguez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.,Department of Microbiology and Biochemistry of Dairy Products, Instituto de Productos Lácteos de Asturias (IPLA), Consejo Superior de Investigaciones Científicas (CSIC), Paseo Río Linares S/N 33300, Villaviciosa, Asturias, Spain
| | - Florentino Fdez-Riverola
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona 29-31, 08034, Barcelona, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/Baldiri Reixac 10, 08028, Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig de Lluís Companys 23, 08010, Barcelona, Spain.,Spanish Bioinformatics Institute INB-ISCIII ES-ELIXIR, 28029, Madrid, Spain
| | - Martin Krallinger
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona 29-31, 08034, Barcelona, Spain. .,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/Baldiri Reixac 10, 08028, Barcelona, Spain. .,Biological Text Mining Unit, Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre, C/Melchor Fernández Almagro 3, 28029, Madrid, Spain.
| | - Anália Lourenço
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain. .,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain. .,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain. .,Centre of Biological Engineering (CEB), University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal.
| |
Collapse
|
6
|
Li M, He Q, Ma J, He F, Zhu Y, Chang C, Chen T. PPICurator: A Tool for Extracting Comprehensive Protein-Protein Interaction Information. Proteomics 2019; 19:e1800291. [PMID: 30521143 DOI: 10.1002/pmic.201800291] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 11/12/2018] [Indexed: 11/07/2022]
Abstract
Protein-protein interaction extraction through biological literature curation is widely employed for proteome analysis. There is a strong need for a tool that can assist researchers in extracting comprehensive PPI information through literature curation, which is critical in research on protein, for example, construction of protein interaction network, identification of protein signaling pathway, and discovery of meaningful protein interaction. However, most of current tools can only extract PPI relations. None of them are capable of extracting other important PPI information, such as interaction directions, effects, and functional annotations. To address these issues, this paper proposes PPICurator, a novel tool for extracting comprehensive PPI information with a variety of logic and syntax features based on a new support vector machine classifier. PPICurator provides a friendly web-based user interface. It is a platform that automates the extraction of comprehensive PPI information through literature, including PPI relations, as well as their confidential scores, interaction directions, effects, and functional annotations. Thus, PPICurator is more comprehensive than state-of-the-art tools. Moreover, it outperforms state-of-the-art tools in the accuracy of PPI relation extraction measured by F-score and recall on the widely used open datasets. PPICurator is available at https://ppicurator.hupo.org.cn.
Collapse
Affiliation(s)
- Mansheng Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Qiang He
- School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, Victoria, 3122, Australia
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Fuchu He
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Tao Chen
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| |
Collapse
|
7
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
8
|
Dai HJ, Singh O. SPRENO: a BioC module for identifying organism terms in figure captions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5032611. [PMID: 29873706 PMCID: PMC6007219 DOI: 10.1093/database/bay048] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Accepted: 04/23/2018] [Indexed: 11/30/2022]
Abstract
Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/. Database URL: https://bigodatamining.github.io/software/201801/
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, 369, Sec. 2, University Rd., Taitung, Taiwan, R.O.C.,Interdisciplinary Program of Green and Information Technology, National Taitung University, 369, Sec. 2, University Rd., Taitung, Taiwan, R.O.C
| | - Onkar Singh
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei, Taiwan, R.O.C.,Institute of Biomedical Informatics, National Yang-Ming University, No. 155, Sec. 2, Linong Street, Taipei, 112 Taiwan, R.O.C
| |
Collapse
|
9
|
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018; 10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open
Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Sophia Daikou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Kenju Ueno
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Jun’ichi Tsujii
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| |
Collapse
|
10
|
Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics 2018; 19:94. [PMID: 29523070 PMCID: PMC5845379 DOI: 10.1186/s12859-018-2103-8] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 03/01/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. RESULTS We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. CONCLUSION Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.
Collapse
Affiliation(s)
- H.-M. Müller
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - K. M. Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - Y. Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - P. W. Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| |
Collapse
|
11
|
Ouyang P, Lin B, Du J, Pan H, Yu H, He R, Huang Z. Global gene expression analysis of knockdown Triosephosphate isomerase (TPI) gene in human gastric cancer cell line MGC-803. Gene 2018; 647:61-72. [DOI: 10.1016/j.gene.2018.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Revised: 12/09/2017] [Accepted: 01/03/2018] [Indexed: 02/07/2023]
|
12
|
Luo L, Yang Z, Lin H, Wang J. Document triage for identifying protein-protein interactions affected by mutations: a neural network ensemble approach. Database (Oxford) 2018; 2018:5103353. [PMID: 30295718 PMCID: PMC6147215 DOI: 10.1093/database/bay097] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Revised: 08/19/2018] [Accepted: 08/21/2018] [Indexed: 01/09/2023]
Abstract
The precision medicine (PM) initiative promises to identify individualized treatment depending on a patients' genetic profile and their related responses. In order to help health professionals and researchers in the PM endeavor, BioCreative VI organized a PM Track to mine protein-protein interactions (PPI) affected by genetic mutations from the biomedical literature. In this paper, we present a neural network ensemble approach to identify relevant articles describing PPI affected by mutations. In this approach, several neural network models are used for document triage, and the ensemble performs better than any individual model. In the official runs, our best submission achieves an F-score of 69.04% in the BioCreative VI PM document triage task. After post-challenge analysis, to address the problem of the limited size of training set, a PPI pre-trained module is incorporated into our approach to further improve the performance. Finally, our best ensemble method achieves an F-score of 71.04% on the test set.
Collapse
Affiliation(s)
- Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
13
|
Eftimov T, Koroušić Seljak B, Korošec P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One 2017. [PMID: 28644863 PMCID: PMC5482438 DOI: 10.1371/journal.pone.0179488] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.
Collapse
Affiliation(s)
- Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- * E-mail:
| | | | - Peter Korošec
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Faculty of Mathematics, Natural Science and Information Technologies, Koper, Slovenia
| |
Collapse
|
14
|
Aydın F, Hüsünbeyi ZM, Özgür A. Automatic query generation using word embeddings for retrieving passages describing experimental methods. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw166. [PMID: 28077568 PMCID: PMC5225401 DOI: 10.1093/database/baw166] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 12/01/2016] [Accepted: 12/01/2016] [Indexed: 01/01/2023]
Abstract
Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article. Database URL: https://github.com/ferhtaydn/biocemid/
Collapse
Affiliation(s)
- Ferhat Aydın
- Department of Computer Engineering, Boğaziçi University, TR-34342 Bebek, Istanbul, Turkey
| | - Zehra Melce Hüsünbeyi
- Department of Computer Engineering, Boğaziçi University, TR-34342 Bebek, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, TR-34342 Bebek, Istanbul, Turkey
| |
Collapse
|
15
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.,Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada
| |
Collapse
|
16
|
Gutiérrez-Sacristán A, Bravo À, Portero-Tresserra M, Valverde O, Armario A, Blanco-Gandía M, Farré A, Fernández-Ibarrondo L, Fonseca F, Giraldo J, Leis A, Mané A, Mayer M, Montagud-Romero S, Nadal R, Ortiz J, Pavon FJ, Perez EJ, Rodríguez-Arias M, Serrano A, Torrens M, Warnault V, Sanz F, Furlong LI. Text mining and expert curation to develop a database on psychiatric diseases and their genes. Database (Oxford) 2017; 2017:3891487. [PMID: 29220439 PMCID: PMC5502359 DOI: 10.1093/database/bax043] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2016] [Revised: 04/27/2017] [Accepted: 05/01/2017] [Indexed: 01/15/2023]
Abstract
Database URL http://www.psygenet.org. PsyGeNET corpus http://www.psygenet.org/ds/PsyGeNET/results/psygenetCorpus.tar.
Collapse
Affiliation(s)
- Alba Gutiérrez-Sacristán
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Àlex Bravo
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Marta Portero-Tresserra
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Olga Valverde
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Antonio Armario
- Institut de Neurociències and Animal Physiology Unit, Universitat Autònoma de Barcelona (UAB), Barcelona, Spain
- Network Biomedical Research Center on Mental Health (CIBERSAM)
| | - M.C. Blanco-Gandía
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Adriana Farré
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Lierni Fernández-Ibarrondo
- Programa de Cáncer (IMIM), Investigación Traslacional en Neoplasias Colorrectales, C/Dr. Aiguader 88, Barcelona, Spain
| | - Francina Fonseca
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Jesús Giraldo
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institut de Neurociències and Unitat de Bioestadística, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Angela Leis
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Anna Mané
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - M.A. Mayer
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Sandra Montagud-Romero
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Roser Nadal
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Institut de Neurociències and Psychobiology Area, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Jordi Ortiz
- Network Biomedical Research Center on Mental Health (CIBERSAM)
- Neuroscience Institute and Department of Biochemistry and Molecular Biology, School of Medicine, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Francisco Javier Pavon
- Unidad de Gestión Clínica de Salud Mental, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Regional Universitario de Málaga, Málaga, Spain
| | - Ezequiel Jesús Perez
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Marta Rodríguez-Arias
- Department of Psychobiology, Facultad de Psicología, Universitat de València, València, Spain
| | - Antonia Serrano
- Unidad de Gestión Clínica de Salud Mental, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Regional Universitario de Málaga, Málaga, Spain
| | - Marta Torrens
- Institute of Neuropsychiatry and Addiction, Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), Parc de Salut Mar, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Vincent Warnault
- Neurobiology of Behaviour Research Group (GReNeC), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Ferran Sanz
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| | - Laura I. Furlong
- Research Group on Integrative Biomedical Informatics (GRIB), Institut Hospital del Mar d'Investigacions Mèdiques (IMIM), DCEXS, Universitat Pompeu Fabra (UPF), C/Dr. Aiguader 88, Barcelona 08003, Spain
| |
Collapse
|
17
|
Chatr-Aryamontri A, Oughtred R, Boucher L, Rust J, Chang C, Kolas NK, O'Donnell L, Oster S, Theesfeld C, Sellam A, Stark C, Breitkreutz BJ, Dolinski K, Tyers M. The BioGRID interaction database: 2017 update. Nucleic Acids Res 2016; 45:D369-D379. [PMID: 27980099 PMCID: PMC5210573 DOI: 10.1093/nar/gkw1102] [Citation(s) in RCA: 682] [Impact Index Per Article: 85.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2016] [Revised: 10/25/2016] [Accepted: 10/27/2016] [Indexed: 01/05/2023] Open
Abstract
The Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org) is an open access database dedicated to the annotation and archival of protein, genetic and chemical interactions for all major model organism species and humans. As of September 2016 (build 3.4.140), the BioGRID contains 1 072 173 genetic and protein interactions, and 38 559 post-translational modifications, as manually annotated from 48 114 publications. This dataset represents interaction records for 66 model organisms and represents a 30% increase compared to the previous 2015 BioGRID update. BioGRID curates the biomedical literature for major model organism species, including humans, with a recent emphasis on central biological processes and specific human diseases. To facilitate network-based approaches to drug discovery, BioGRID now incorporates 27 501 chemical-protein interactions for human drug targets, as drawn from the DrugBank database. A new dynamic interaction network viewer allows the easy navigation and filtering of all genetic and protein interaction data, as well as for bioactive compounds and their established targets. BioGRID data are directly downloadable without restriction in a variety of standardized formats and are freely distributed through partner model organism databases and meta-databases.
Collapse
Affiliation(s)
- Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Quebec H3T 1J4, Canada
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Lorrie Boucher
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Christie Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Nadine K Kolas
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Lara O'Donnell
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Sara Oster
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Chandra Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Adnane Sellam
- Centre Hospitalier de l'Université Laval (CHUL), Québec, Québec G1V 4G2, Canada
| | - Chris Stark
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Bobby-Joe Breitkreutz
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Quebec H3T 1J4, Canada .,The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| |
Collapse
|
18
|
Wang Q, S Abdul S, Almeida L, Ananiadou S, Balderas-Martínez YI, Batista-Navarro R, Campos D, Chilton L, Chou HJ, Contreras G, Cooper L, Dai HJ, Ferrell B, Fluck J, Gama-Castro S, George N, Gkoutos G, Irin AK, Jensen LJ, Jimenez S, Jue TR, Keseler I, Madan S, Matos S, McQuilton P, Milacic M, Mort M, Natarajan J, Pafilis E, Pereira E, Rao S, Rinaldi F, Rothfels K, Salgado D, Silva RM, Singh O, Stefancsik R, Su CH, Subramani S, Tadepally HD, Tsaprouni L, Vasilevsky N, Wang X, Chatr-Aryamontri A, Laulederkind SJF, Matis-Mitchell S, McEntyre J, Orchard S, Pundir S, Rodriguez-Esteban R, Van Auken K, Lu Z, Schaeffer M, Wu CH, Hirschman L, Arighi CN. Overview of the interactive task in BioCreative V. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw119. [PMID: 27589961 PMCID: PMC5009325 DOI: 10.1093/database/baw119] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 07/28/2016] [Indexed: 11/14/2022]
Abstract
Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested. Database URL:http://www.biocreative.org
Collapse
Affiliation(s)
- Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Shabbir S Abdul
- International Centre of Health Information Technology, Taipei Medical University, Taipei, Taiwan
| | - Lara Almeida
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sophia Ananiadou
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | | | | | | | - Lucy Chilton
- Northern Institute for Cancer Research, Newcastle University, New Castle, UK
| | - Hui-Jou Chou
- Rutgers University-Camden, Camden, NJ 08102, USA
| | - Gabriela Contreras
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | - Laurel Cooper
- Department of Botany and Plant Pathology, Oregon State University Corvallis, OR 97331, USA
| | - Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Barbra Ferrell
- College of Agriculture and Natural Resources, University of Delaware, Newark, DE 19711, USA
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | | | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham B15 2TT, UK Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham B15 2TT, UK
| | - Afroza K Irin
- Life Science Informatics, University of Bonn, Bonn, Germany
| | - Lars J Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Silvia Jimenez
- Blue Brain Project, École Polytechnique Fédérale de Lausanne (EPFL) Biotech Campus, Geneva, Switzerland
| | - Toni R Jue
- Prince of Wales Clinical School, University of New South Wales NSW, Sydney, New South Wales, Australia
| | | | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | | | - Marija Milacic
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Matthew Mort
- HGMD, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | - Jeyakumar Natarajan
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Emiliano Pereira
- Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
| | - Shruti Rao
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University, Washington, DC 20007, USA
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Karen Rothfels
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - David Salgado
- GMGF, Aix-Marseille Universite, 13385 Marseille, France Inserm, UMR_S 910, 13385 Marseille, France
| | - Raquel M Silva
- Department of Medical Sciences, iBiMED & IEETA, University of Aveiro, 3810-193 Aveiro, Portugal
| | - Onkar Singh
- Taipei Medical University Graduate Institute of Biomedical informatics, Taipei, Taiwan
| | | | - Chu-Hsien Su
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Suresh Subramani
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | | | - Loukia Tsaprouni
- Institute of Sport and Physical Activity Research (ISPAR), University of Bedfordshire, Bedford, UK
| | - Nicole Vasilevsky
- Ontology Development Group, Oregon Health & Science University, Portland, OR 97239, USA
| | - Xiaodong Wang
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | | | | | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Sangya Pundir
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | | | - Kimberly Van Auken
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda, MD 20894, USA
| | - Mary Schaeffer
- MaizeGDB USDA ARS and University of Missouri, Columbia, MO 65211, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | | | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| |
Collapse
|
19
|
Chang YC, Chu CH, Su YC, Chen CC, Hsu WL. PIPE: a protein-protein interaction passage extraction module for BioCreative challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw101. [PMID: 27524807 PMCID: PMC4983456 DOI: 10.1093/database/baw101] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 06/02/2016] [Indexed: 11/13/2022]
Abstract
Identifying the interactions between proteins mentioned in biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this article, we propose PIPE, an interaction pattern generation module used in the Collaborative Biocurator Assistant Task at BioCreative V (http://www.biocreative.org/) to capture frequent protein-protein interaction (PPI) patterns within text. We also present an interaction pattern tree (IPT) kernel method that integrates the PPI patterns with convolution tree kernel (CTK) to extract PPIs. Methods were evaluated on LLL, IEPA, HPRD50, AIMed and BioInfer corpora using cross-validation, cross-learning and cross-corpus evaluation. Empirical evaluations demonstrate that our method is effective and outperforms several well-known PPI extraction methods. Database URL:
Collapse
Affiliation(s)
- Yung-Chun Chang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Chun-Han Chu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yu-Chen Su
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Chien Chin Chen
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
20
|
Shin SY, Kim S, Wilbur WJ, Kwon D. BioC viewer: a web-based tool for displaying and merging annotations in BioC. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw106. [PMID: 27515823 PMCID: PMC4980568 DOI: 10.1093/database/baw106] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Accepted: 06/23/2016] [Indexed: 12/20/2022]
Abstract
BioC is an XML-based format designed to provide interoperability for text mining tools and manual curation results. A challenge of BioC as a standard format is to align annotations from multiple systems. Ideally, this should not be a major problem if users follow guidelines given by BioC key files. Nevertheless, the misalignment between text and annotations happens quite often because different systems tend to use different software development environments, e.g. ASCII vs. Unicode. We first implemented the BioC Viewer to assist BioGRID curators as a part of the BioCreative V BioC track (Collaborative Biocurator Assistant Task). For the BioC track, the BioC Viewer helped curate protein-protein interaction and genetic interaction pairs appearing in full-text articles. Here, we describe the BioC Viewer itself as well as improvements made to the BioC Viewer since the BioCreative V Workshop to address the misalignment issue of BioC annotations. While uploading BioC files, a BioC merge process is offered when there are files from the same full-text article. If there is a mismatch between an annotated offset and text, the BioC Viewer adjusts the offset to correctly align with the text. The BioC Viewer has a user-friendly interface, where most operations can be performed within a few mouse clicks. The feedback from BioGRID curators has been positive for the web interface, particularly for its usability and learnability. Database URL: http://viewer.bioqrator.org
Collapse
Affiliation(s)
- Soo-Yong Shin
- Department of Biomedical Informatics, Asan Medical Center, Seoul 05505, Korea
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD 20894, USA
| | - Dongseop Kwon
- Deptartment of Computer Engineering, Myongji University, Yongin, Gyeonggi-do 17058, Korea
| |
Collapse
|
21
|
Peng Y, Arighi C, Wu CH, Vijay-Shanker K. BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw072. [PMID: 27170286 PMCID: PMC4915133 DOI: 10.1093/database/baw072] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/12/2016] [Indexed: 12/04/2022]
Abstract
There has been a large growth in the number of biomedical publications that report experimental results. Many of these results concern detection of protein–protein interactions (PPI). In BioCreative V, we participated in the BioC task and developed a PPI system to detect text passages with PPIs in the full-text articles. By adopting the BioC format, the output of the system can be seamlessly added to the biocuration pipeline with little effort required for the system integration. A distinctive feature of our PPI system is that it utilizes extended dependency graph, an intermediate level of representation that attempts to abstract away syntactic variations in text. As a result, we are able to use only a limited set of rules to extract PPI pairs in the sentences, and additional rules to detect additional passages for PPI pairs. For evaluation, we used the 95 articles that were provided for the BioC annotation task. We retrieved the unique PPIs from the BioGRID database for these articles and show that our system achieves a recall of 83.5%. In order to evaluate the detection of passages with PPIs, we further annotated Abstract and Results sections of 20 documents from the dataset and show that an f-value of 80.5% was obtained. To evaluate the generalizability of the system, we also conducted experiments on AIMed, a well-known PPI corpus. We achieved an f-value of 76.1% for sentence detection and an f-value of 64.7% for unique PPI detection. Database URL:http://proteininformationresource.org/iprolink/corpora
Collapse
Affiliation(s)
- Yifan Peng
- Computer & Information Sciences, University of Delaware and
| | - Cecilia Arighi
- Computer & Information Sciences, University of Delaware and Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | - Cathy H Wu
- Computer & Information Sciences, University of Delaware and Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | | |
Collapse
|