1
|
Zhang Z, Fang M, Wu R, Zong H, Huang H, Tong Y, Xie Y, Cheng S, Wei Z, Crabbe MJC, Zhang X, Wang Y. Large-Scale Biomedical Relation Extraction Across Diverse Relation Types: Model Development and Usability Study on COVID-19. J Med Internet Res 2023; 25:e48115. [PMID: 37632414 PMCID: PMC10551783 DOI: 10.2196/48115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 07/03/2023] [Accepted: 08/25/2023] [Indexed: 08/28/2023] Open
Abstract
BACKGROUND Biomedical relation extraction (RE) is of great importance for researchers to conduct systematic biomedical studies. It not only helps knowledge mining, such as knowledge graphs and novel knowledge discovery, but also promotes translational applications, such as clinical diagnosis, decision-making, and precision medicine. However, the relations between biomedical entities are complex and diverse, and comprehensive biomedical RE is not yet well established. OBJECTIVE We aimed to investigate and improve large-scale RE with diverse relation types and conduct usability studies with application scenarios to optimize biomedical text mining. METHODS Data sets containing 125 relation types with different entity semantic levels were constructed to evaluate the impact of entity semantic information on RE, and performance analysis was conducted on different model architectures and domain models. This study also proposed a continued pretraining strategy and integrated models with scripts into a tool. Furthermore, this study applied RE to the COVID-19 corpus with article topics and application scenarios of clinical interest to assess and demonstrate its biological interpretability and usability. RESULTS The performance analysis revealed that RE achieves the best performance when the detailed semantic type is provided. For a single model, PubMedBERT with continued pretraining performed the best, with an F1-score of 0.8998. Usability studies on COVID-19 demonstrated the interpretability and usability of RE, and a relation graph database was constructed, which was used to reveal existing and novel drug paths with edge explanations. The models (including pretrained and fine-tuned models), integrated tool (Docker), and generated data (including the COVID-19 relation graph database and drug paths) have been made publicly available to the biomedical text mining community and clinical researchers. CONCLUSIONS This study provided a comprehensive analysis of RE with diverse relation types. Optimized RE models and tools for diverse relation types were developed, which can be widely used in biomedical text mining. Our usability studies provided a proof-of-concept demonstration of how large-scale RE can be leveraged to facilitate novel research.
Collapse
Affiliation(s)
- Zeyu Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Meng Fang
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China
| | - Rebecca Wu
- University of California, Berkeley, Berkeley, CA, United States
| | - Hui Zong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Institutes for Systems Genetics, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Honglian Huang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yuantao Tong
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yujia Xie
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Shiyang Cheng
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Ziyi Wei
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - M James C Crabbe
- Wolfson College, Oxford University, Oxford, United Kingdom
- Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton, United Kingdom
- School of Life Sciences, Shanxi University, Taiyuan, China
| | - Xiaoyan Zhang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Ying Wang
- Research Center for Translational Medicine, Shanghai East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China
- Department of Clinical Laboratory Medicine Center, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
- Department of Laboratory Medicine, Shanghai Eastern Hepatobiliary Surgery Hospital, Shanghai, China
| |
Collapse
|
2
|
Islamaj R, Kwon D, Kim S, Lu Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res 2020; 48:W5-W11. [PMID: 32383756 DOI: 10.1093/nar/gkaa333] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 12/20/2022] Open
Abstract
Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for figure display, project management, and multi-user team annotation. In response, we developed TeamTat (https://www.teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC (uploaded locally or automatically retrieved from PubMed/PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotator's convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Dongseop Kwon
- School of Software Convergence, Myongji University, Seoul 03674, South Korea
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
3
|
Grainha TRR, Jorge PADS, Pérez-Pérez M, Pérez Rodríguez G, Pereira MOBO, Lourenço AMG. Exploring anti-quorum sensing and anti-virulence based strategies to fight Candida albicans infections: an in silico approach. FEMS Yeast Res 2019. [PMID: 29518242 DOI: 10.1093/femsyr/foy022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The complex virulence attributes of Candida albicans are an attractive target to exploit in the development of new antifungals and anti-virulence strategies to combat C. albicans infections. Particularly, quorum sensing (QS) has been reported as critical for virulence regulation in C. albicans. This work presents two knowledge networks with up-to-date information about QS regulation and experimentally tested anti-QS and anti-virulence agents for C. albicans. A semi-automatic bioinformatics workflow that combines literature mining and expert curation was used to retrieve otherwise scattered information from the scientific literature. The network representation offers an innovative and continuously updatable means for the Candida research community to query QS and virulence data systematically and in a user-friendly way. Notably, the reconstructed networks show the complexity of QS regulation and the impact that some molecules have on the inhibition of virulence mechanisms responsible for infection establishment (e.g. hyphal development) and perseverance (e.g. biofilm formation). In the future, the compiled knowledge may be used to build decision-making models that help infer new knowledge of practical significance. The knowledge networks are publicly available at http://pcquorum.org/. This Web platform enables the exploration of fungal virulence cues as well as reported inhibitors in a user-friendly fashion.
Collapse
Affiliation(s)
- Tânia Raquel Rodrigues Grainha
- CEB-Centre of Biological Engineering, LIBRO-Laboratory of Research in Biofilms Rosário Oliveira, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Paula Alexandra da Silva Jorge
- CEB-Centre of Biological Engineering, LIBRO-Laboratory of Research in Biofilms Rosário Oliveira, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Martín Pérez-Pérez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politecnico, s/n Campus As Lagoas, 32004 Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Gael Pérez Rodríguez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politecnico, s/n Campus As Lagoas, 32004 Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Maria Olívia Baptista Oliveira Pereira
- CEB-Centre of Biological Engineering, LIBRO-Laboratory of Research in Biofilms Rosário Oliveira, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Anália Maria Garcia Lourenço
- ESEI-Department of Computer Science, University of Vigo, Edificio Politecnico, s/n Campus As Lagoas, 32004 Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain.,Centre of Biological Engineering (CEB), University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
4
|
Labbé C, Grima N, Gautier T, Favier B, Byrne JA. Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool. PLoS One 2019; 14:e0213266. [PMID: 30822319 PMCID: PMC6396917 DOI: 10.1371/journal.pone.0213266] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 02/18/2019] [Indexed: 12/14/2022] Open
Abstract
Nucleotide sequence reagents are verifiable experimental reagents in biomedical publications, because their sequence identities can be independently verified and compared with associated text descriptors. We have previously reported that incorrectly identified nucleotide sequence reagents are characteristic of highly similar human gene knockdown studies, some of which have been retracted from the literature on account of possible research fraud. Because of the throughput limitations of manual verification of nucleotide sequences, we developed a semi-automated fact checking tool, Seek & Blastn, to verify the targeting or non-targeting status of published nucleotide sequence reagents. From previously described and unknown corpora of 48 and 155 publications, respectively, Seek & Blastn correctly extracted 304/342 (88.9%) and 1066/1522 (70.0%) nucleotide sequences and a predicted targeting/ non-targeting status. Seek & Blastn correctly predicted the targeting/ non-targeting status of 293/304 (96.4%) and 988/1066 (92.7%) of the correctly extracted nucleotide sequences. A total of 38/39 (97.4%) or 31/79 (39.2%) Seek & Blastn predictions of incorrect nucleotide sequence reagent use were correct in the two literature corpora. Combined Seek & Blastn and manual analyses identified a list of 91 misidentified nucleotide sequence reagents, which could be built upon through future studies. In summary, incorrect nucleotide sequence reagents represent an under-recognized source of error within the biomedical literature, and fact checking tools such as Seek & Blastn may help to identify papers and manuscripts affected by these errors.
Collapse
Affiliation(s)
- Cyril Labbé
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
- * E-mail: (CL); (JAB)
| | - Natalie Grima
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
| | - Thierry Gautier
- INSERM U1209/ CNRS UMR 5309, Univ. Grenoble Alpes, Grenoble, France
| | - Bertrand Favier
- Univ. Grenoble Alpes, Team GREPI, Etablissement Français du Sang, La Tronche, France
| | - Jennifer A. Byrne
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
- Discipline of Child and Adolescent Health, Faculty of Medicine and Health, The University of Sydney, Westmead, New South Wales, Australia
- * E-mail: (CL); (JAB)
| |
Collapse
|
5
|
Akhondi SA, Rey H, Schwörer M, Maier M, Toomey J, Nau H, Ilchmann G, Sheehan M, Irmer M, Bobach C, Doornenbal M, Gregory M, Kors JA. Automatic identification of relevant chemical compounds from patents. Database (Oxford) 2019; 2019:5301319. [PMID: 30698776 PMCID: PMC6351730 DOI: 10.1093/database/baz001] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Revised: 12/16/2018] [Accepted: 12/28/2018] [Indexed: 12/29/2022]
Abstract
In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevier's Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patent's context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The gold-standard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation.
Collapse
Affiliation(s)
- Saber A Akhondi
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, CA, Netherlands
- Elsevier B.V., Radarweg 29, Amsterdam NX, The Netherlands
| | - Hinnerk Rey
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - Markus Schwörer
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - Michael Maier
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - John Toomey
- Elsevier Limited, 125 London Wall, London, UK
| | - Heike Nau
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - Gabriele Ilchmann
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - Mark Sheehan
- Elsevier Information Systems GmbH, Theodor-Heuss-Allee 108, Frankfurt, Germany
| | - Matthias Irmer
- OntoChem IT Solutions GmbH, Blücherstraße 24, Halle (Saale), Germany
| | - Claudia Bobach
- OntoChem IT Solutions GmbH, Blücherstraße 24, Halle (Saale), Germany
| | | | | | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, CA, Netherlands
| |
Collapse
|
6
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
7
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.,Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada
| |
Collapse
|
8
|
Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Collaborative relation annotation and quality analysis in Markyt environment. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4693828. [PMID: 29220479 PMCID: PMC5737204 DOI: 10.1093/database/bax090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 11/09/2017] [Indexed: 11/30/2022]
Abstract
Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org. Database URL:http://markyt.org
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Gael Pérez-Rodríguez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Florentino Fdez-Riverola
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Anália Lourenço
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain.,CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|