1
|
Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024; 25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools. RESULTS We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances. CONCLUSIONS MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.
Collapse
Affiliation(s)
- Ornella Irrera
- Department of Information Engineering, University of Padova, Padua, Italy.
| | - Stefano Marchesin
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
2
|
Liu L, Perez-Concha O, Nguyen A, Bennett V, Blake V, Gallego B, Jorm L. Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study. Interact J Med Res 2023; 12:e46322. [PMID: 37624624 PMCID: PMC10492176 DOI: 10.2196/46322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/31/2023] [Accepted: 07/24/2023] [Indexed: 08/26/2023] Open
Abstract
BACKGROUND The narrative free-text data in electronic medical records (EMRs) contain valuable clinical information for analysis and research to inform better patient care. However, the release of free text for secondary use is hindered by concerns surrounding personally identifiable information (PII), as protecting individuals' privacy is paramount. Therefore, it is necessary to deidentify free text to remove PII. Manual deidentification is a time-consuming and labor-intensive process. Numerous automated deidentification approaches and systems have been attempted to overcome this challenge over the past decade. OBJECTIVE We sought to develop an accurate, web-based system deidentifying free text (DEFT), which can be readily and easily adopted in real-world settings for deidentification of free text in EMRs. The system has several key features including a simple and task-focused web user interface, customized PII types, use of a state-of-the-art deep learning model for tagging PII from free text, preannotation by an interactive learning loop, rapid manual annotation with autosave, support for project management and team collaboration, user access control, and central data storage. METHODS DEFT comprises frontend and backend modules and communicates with central data storage through a filesystem path access. The frontend web user interface provides end users with a user-friendly workspace for managing and annotating free text. The backend module processes the requests from the frontend and performs relevant persistence operations. DEFT manages the deidentification workflow as a project, which can contain one or more data sets. Customized PII types and user access control can also be configured. The deep learning model is based on a Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) with RoBERTa as the word embedding layer. The interactive learning loop is further integrated into DEFT to speed up the deidentification process and increase its performance over time. RESULTS DEFT has many advantages over existing deidentification systems in terms of its support for project management, user access control, data management, and an interactive learning process. Experimental results from DEFT on the 2014 i2b2 data set obtained the highest performance compared to 5 benchmark models in terms of microaverage strict entity-level recall and F1-scores of 0.9563 and 0.9627, respectively. In a real-world use case of deidentifying clinical notes, extracted from 1 referral hospital in Sydney, New South Wales, Australia, DEFT achieved a high microaverage strict entity-level F1-score of 0.9507 on a corpus of 600 annotated clinical notes. Moreover, the manual annotation process with preannotation demonstrated a 43% increase in work efficiency compared to the process without preannotation. CONCLUSIONS DEFT is designed for health domain researchers and data custodians to easily deidentify free text in EMRs. DEFT supports an interactive learning loop and end users with minimal technical knowledge can perform the deidentification work with only a shallow learning curve.
Collapse
Affiliation(s)
- Leibo Liu
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Oscar Perez-Concha
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Anthony Nguyen
- Australian e-Health Research Centre (AEHRC), Commonwealth Scientific and Industrial Research Organisation (CSIRO), Brisbane, Australia
| | - Vicki Bennett
- Metadata, Information Management and Classifications Unit (MIMCU), Australian Institute of Health and Welfare, Canberra, Australia
| | - Victoria Blake
- Eastern Heart Clinic, Prince of Wales Hospital, Randwick, Australia
| | - Blanca Gallego
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Louisa Jorm
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| |
Collapse
|
3
|
Contreras Hernández S, Tzili Cruz MP, Espínola Sánchez JM, Pérez Tzili A. Deep Learning Model for COVID-19 Sentiment Analysis on Twitter. NEW GENERATION COMPUTING 2023; 41:189-212. [PMID: 37229180 PMCID: PMC10010651 DOI: 10.1007/s00354-023-00209-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 02/23/2023] [Indexed: 05/27/2023]
Abstract
The COVID-19 pandemic impacted the mood of the people, and this was evident on social networks. These common user publications are a source of information to measure the population's opinion on social phenomena. In particular, the Twitter network represents a resource of great value due to the amount of information, the geographical distribution of the publications and the openness to dispose of them. This work presents a study on the feelings of the population in Mexico during one of the waves that produced the most contagion and deaths in this country. A mixed, semi-supervised approach was used, with a lexical-based data labeling technique to later bring these data to a pre-trained model of Transformers completely in Spanish. Two Spanish-language models were trained by adding to the Transformers neural network the adjustment for the sentiment analysis task specifically on COVID-19. In addition, ten other multilanguage Transformer models including the Spanish language were trained with the same data set and parameters to compare their performance. In addition, other classifiers with the same data set were used for training and testing, such as Support Vector Machines, Naive Bayes, Logistic Regression, and Decision Trees. These performances were compared with the exclusive model in Spanish based on Transformers, which had higher precision. Finally, this model was used, developed exclusively based on the Spanish language, with new data, to measure the sentiment about COVID-19 of the Twitter community in Mexico.
Collapse
Affiliation(s)
- Salvador Contreras Hernández
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | - María Patricia Tzili Cruz
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | - José Martín Espínola Sánchez
- Department of Informatics, Universidad Politécnica del Valle de México, 54910 Tultitlán Estado de México, Mexico
| | | |
Collapse
|
4
|
Chitwood DG, Wang Q, Klaubert SR, Green K, Wu CH, Harcum SW, Saski CA. Microevolutionary dynamics of eccDNA in Chinese hamster ovary cells grown in fed-batch cultures under control and lactate-stressed conditions. Sci Rep 2023; 13:1200. [PMID: 36681715 PMCID: PMC9862248 DOI: 10.1038/s41598-023-27962-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Chinese hamster ovary (CHO) cell lines are widely used to manufacture biopharmaceuticals. However, CHO cells are not an optimal expression host due to the intrinsic plasticity of the CHO genome. Genome plasticity can lead to chromosomal rearrangements, transgene exclusion, and phenotypic drift. A poorly understood genomic element of CHO cell line instability is extrachromosomal circular DNA (eccDNA) in gene expression and regulation. EccDNA can facilitate ultra-high gene expression and are found within many eukaryotes including humans, yeast, and plants. EccDNA confers genetic heterogeneity, providing selective advantages to individual cells in response to dynamic environments. In CHO cell cultures, maintaining genetic homogeneity is critical to ensuring consistent productivity and product quality. Understanding eccDNA structure, function, and microevolutionary dynamics under various culture conditions could reveal potential engineering targets for cell line optimization. In this study, eccDNA sequences were investigated at the beginning and end of two-week fed-batch cultures in an ambr®250 bioreactor under control and lactate-stressed conditions. This work characterized structure and function of eccDNA in a CHO-K1 clone. Gene annotation identified 1551 unique eccDNA genes including cancer driver genes and genes involved in protein production. Furthermore, RNA-seq data is integrated to identify transcriptionally active eccDNA genes.
Collapse
Affiliation(s)
- Dylan G Chitwood
- Department of Bioengineering, Clemson University, Clemson, SC, USA
| | - Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Stephanie R Klaubert
- Department of Chemical and Biomolecular Engineering, Clemson University, Clemson, SC, USA
| | - Kiana Green
- Department of Biological Sciences, University of South Carolina, Columbia, SC, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Sarah W Harcum
- Department of Bioengineering, Clemson University, Clemson, SC, USA
- Department of Chemical and Biomolecular Engineering, Clemson University, Clemson, SC, USA
| | - Christopher A Saski
- Department of Plant and Environmental Sciences, Clemson University, Clemson, SC, USA.
| |
Collapse
|
5
|
Li X, Tang X, Lu W. Tracking biomedical articles along the translational continuum: a measure based on biomedical knowledge representation. Scientometrics 2023; 128:1295-1319. [PMID: 36570779 PMCID: PMC9758472 DOI: 10.1007/s11192-022-04607-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 11/28/2022] [Indexed: 12/23/2022]
Abstract
Keeping track of translational research is essential to evaluating the performance of programs on translational medicine. Despite several indicators in previous studies, a consensus measure is still needed to represent the translational features of biomedical research at the article level. In this study, we first trained semantic representations of biomedical entities and documents (i.e., bio-entity2vec and bio-doc2vec) based on over 30 million PubMed articles. With these vectors, we then developed a new measure called Translational Progression (TP) for tracking biomedical articles along the translational continuum. We validated the effectiveness of TP from two perspectives (Clinical trial phase identification and ACH classification), which showed excellent consistency between TP and other indicators. Meanwhile, TP has several advantages. First, it can track the degree of translation of biomedical research dynamically and in real-time. Second, it is straightforward to interpret and operationalize. Third, it doesn't require labor-intensive MeSH labeling and it is suitable for big scholarly data as well as papers that are not indexed in PubMed. In addition, we examined the translational progressions of biomedical research from three dimensions (including overall distribution, time, and research topic), which revealed three significant findings. The proposed measure in this study could be used by policymakers to monitor biomedical research with high translational potential in real-time and make better decisions. It can also be adopted and improved for other domains, such as physics or computer science, to assess the application value of scientific discoveries.
Collapse
Affiliation(s)
- Xin Li
- School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030 Hubei China
| | - Xuli Tang
- School of Information Management, Central China Normal University, Wuhan, 430079 Hubei China
| | - Wei Lu
- School of Information Management, Wuhan University, Wuhan, 430072 Hubei China
| |
Collapse
|
6
|
Dinh TT, Vo-Chanh TP, Nguyen C, Huynh VQ, Vo N, Nguyen HD. Extract antibody and antigen names from biomedical literature. BMC Bioinformatics 2022; 23:524. [PMID: 36474140 PMCID: PMC9727932 DOI: 10.1186/s12859-022-04993-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/18/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The roles of antibody and antigen are indispensable in targeted diagnosis, therapy, and biomedical discovery. On top of that, massive numbers of new scientific articles about antibodies and/or antigens are published each year, which is a precious knowledge resource but has yet been exploited to its full potential. We, therefore, aim to develop a biomedical natural language processing tool that can automatically identify antibody and antigen entities from articles. RESULTS We first annotated an antibody-antigen corpus including 3210 relevant PubMed abstracts using a semi-automatic approach. The Inter-Annotator Agreement score of 3 annotators ranges from 91.46 to 94.31%, indicating that the annotations are consistent and the corpus is reliable. We then used the corpus to develop and optimize BiLSTM-CRF-based and BioBERT-based models. The models achieved overall F1 scores of 62.49% and 81.44%, respectively, which showed potential for newly studied entities. The two models served as foundation for development of a named entity recognition (NER) tool that automatically recognizes antibody and antigen names from biomedical literature. CONCLUSIONS Our antibody-antigen NER models enable users to automatically extract antibody and antigen names from scientific articles without manually scanning through vast amounts of data and information in the literature. The output of NER can be used to automatically populate antibody-antigen databases, support antibody validation, and facilitate researchers with the most appropriate antibodies of interest. The packaged NER model is available at https://github.com/TrangDinh44/ABAG_BioBERT.git .
Collapse
Affiliation(s)
- Thuy Trang Dinh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Trang Phuong Vo-Chanh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Chau Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Viet Quoc Huynh
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| | - Nam Vo
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam ,grid.454160.20000 0004 0642 8526Laboratory of Molecular Biotechnology, University of Science, Ho Chi Minh City, Vietnam
| | - Hoang Duc Nguyen
- grid.454160.20000 0004 0642 8526Center for Bioscience and Biotechnology, University of Science, Ho Chi Minh City, Vietnam ,grid.444808.40000 0001 2037 434XVietnam National University, Ho Chi Minh City, Vietnam
| |
Collapse
|
7
|
Allahgholi M, Rahmani H, Javdani D, Sadeghi-Adl Z, Bender A, Módos D, Weiss G. DDREL: From drug-drug relationships to drug repurposing. INTELL DATA ANAL 2022. [DOI: 10.3233/ida-215745] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Analyzing the relationships among various drugs is an essential issue in the field of computational biology. Different kinds of informative knowledge, such as drug repurposing, can be extracted from drug-drug relationships. Scientific literature represents a rich source for the retrieval of knowledge about the relationships between biological concepts, mainly drug-drug, disease-disease, and drug-disease relationships. In this paper, we propose DDREL as a general-purpose method that applies deep learning on scientific literature to automatically extract the graph of syntactic and semantic relationships among drugs. DDREL remarkably outperforms the existing human drug network method and a random network respected to average similarities of drugs’ anatomical therapeutic chemical (ATC) codes. DDREL is able to shed light on the existing deficiency of the ATC codes in various drug groups. From the DDREL graph, the history of drug discovery became visible. In addition, drugs that had repurposing score 1 (diflunisal, pargyline, fenofibrate, guanfacine, chlorzoxazone, doxazosin, oxymetholone, azathioprine, drotaverine, demecarium, omifensine, yohimbine) were already used in additional indication. The proposed DDREL method justifies the predictive power of textual data in PubMed abstracts. DDREL shows that such data can be used to 1- Predict repurposing drugs with high accuracy, and 2- Reveal existing deficiencies of the ATC codes in various drug groups.
Collapse
Affiliation(s)
- Milad Allahgholi
- School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
| | - Hossein Rahmani
- School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
| | - Delaram Javdani
- School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
| | - Zahra Sadeghi-Adl
- School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Dezsö Módos
- Quadram Institute Bioscience, Norwich Research Park, Norwich, Norfolk, UK
- Earlham Institute, Norwich Research Park, Norwich, Norfolk, UK
| | - Gerhard Weiss
- Department of Data Science and Knowledge Engineering (DKE), Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
8
|
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak 2021; 21:352. [PMID: 34922517 PMCID: PMC8684237 DOI: 10.1186/s12911-021-01706-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/01/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. RESULTS We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. CONCLUSIONS MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.
Collapse
Affiliation(s)
- Fabio Giachelle
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
9
|
Grissette H, Nfaoui EH. Affective Concept-Based Encoding of Patient Narratives via Sentic Computing and Neural Networks. Cognit Comput 2021; 14:274-299. [PMID: 34422122 PMCID: PMC8371039 DOI: 10.1007/s12559-021-09903-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 06/23/2021] [Indexed: 11/30/2022]
Abstract
The automatic generation of features without human intervention is the most critical task for biomedical sentiment analysis. Regarding the high dynamicity of shared patient narrative data, the lack of formal medical language sentiment dictionaries prevents retrieval of the appropriate sentiment, which is unapproachable and can be prone to annotator bias. We propose a novel affective biomedical concept-based encoding via sentic computing and neural networks. The main contributions include four aspects. First, a biomedical embedding, in which a medical entity is defined, normalized, and synthesized from a text, is built using online patient narratives after being combined with label propagation from a widely used comprehensive biomedical vocabulary. Second, considering the dependence on biomedical definitions, drug reaction sample selection based on general matching is suggested. These feature settings are then used to build and recognize affective semantics and sentics based on an extreme learning machine. Finally, a semisupervised LSTM-BiLSTM model for biomedical sentiment analysis is constructed. There was a massive influx of patient self-reports related to the COVID-19 pandemic. A study was conducted in this direction, and we tested the validity, medical language familiarity, and transferability of our approach by analyzing millions of COVID-19 tweets. Comparisons to affective lexicons also indicate that integrating extreme learning machine cognitive capabilities has advantages over biomedical sentiment analysis. By considering sentics vectors on top of the formed embeddings, our semisupervised LSTM-BiLSTM achieved an accuracy of 87.5%. The evaluations of unsupervised learning approximated the results of the previous model when dealing with a serious loss of biomedical data. In this paper, we demonstrate the effectiveness of integrating deep-learning-based cognitive capabilities for both enhancing distributed biomedical definitions and inferring sentiment compositions from many patient self-reports on social networks. The relevant encoding of affective information conveyed regarding medication subjects clearly reveals defined roles and expectations that can have a positive impact on public health.
Collapse
Affiliation(s)
- Hanane Grissette
- LISAC Laboratory, Faculty of Sciences Dhar EL Mahraz, Sidi Mohamed Ben Abdellah University, Fez, Morocco
| | - El Habib Nfaoui
- LISAC Laboratory, Faculty of Sciences Dhar EL Mahraz, Sidi Mohamed Ben Abdellah University, Fez, Morocco
| |
Collapse
|
10
|
Dobbie S, Strafford H, Pickrell WO, Fonferko-Shadrach B, Jones C, Akbari A, Thompson S, Lacey A. Markup: A Web-Based Annotation Tool Powered by Active Learning. Front Digit Health 2021; 3:598916. [PMID: 34713086 PMCID: PMC8521860 DOI: 10.3389/fdgth.2021.598916] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 06/16/2021] [Indexed: 11/13/2022] Open
Abstract
Across various domains, such as health and social care, law, news, and social media, there are increasing quantities of unstructured texts being produced. These potential data sources often contain rich information that could be used for domain-specific and research purposes. However, the unstructured nature of free-text data poses a significant challenge for its utilisation due to the necessity of substantial manual intervention from domain-experts to label embedded information. Annotation tools can assist with this process by providing functionality that enables the accurate capture and transformation of unstructured texts into structured annotations, which can be used individually, or as part of larger Natural Language Processing (NLP) pipelines. We present Markup (https://www.getmarkup.com/) an open-source, web-based annotation tool that is undergoing continued development for use across all domains. Markup incorporates NLP and Active Learning (AL) technologies to enable rapid and accurate annotation using custom user configurations, predictive annotation suggestions, and automated mapping suggestions to both domain-specific ontologies, such as the Unified Medical Language System (UMLS), and custom, user-defined ontologies. We demonstrate a real-world use case of how Markup has been used in a healthcare setting to annotate structured information from unstructured clinic letters, where captured annotations were used to build and test NLP applications.
Collapse
Affiliation(s)
- Samuel Dobbie
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Huw Strafford
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - W. Owen Pickrell
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Neurology Department, Morriston Hospital, Swansea Bay University Health Board, Swansea, United Kingdom
| | | | - Carys Jones
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Ashley Akbari
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Simon Thompson
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| | - Arron Lacey
- Health Data Research UK, Swansea University Medical School, Swansea University, Swansea, United Kingdom
- Swansea University Medical School, Swansea University, Swansea, United Kingdom
| |
Collapse
|
11
|
Hobbs ET, Goralski SM, Mitchell A, Simpson A, Leka D, Kotey E, Sekira M, Munro JB, Nadendla S, Jackson R, Gonzalez-Aguirre A, Krallinger M, Giglio M, Erill I. ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts. Front Res Metr Anal 2021; 6:674205. [PMID: 34327299 PMCID: PMC8313968 DOI: 10.3389/frma.2021.674205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 06/28/2021] [Indexed: 11/20/2022] Open
Abstract
Analysis of high-throughput experiments in the life sciences frequently relies upon standardized information about genes, gene products, and other biological entities. To provide this information, expert curators are increasingly relying on text mining tools to identify, extract and harmonize statements from biomedical journal articles that discuss findings of interest. For determining reliability of the statements, curators need the evidence used by the authors to support their assertions. It is important to annotate the evidence directly used by authors to qualify their findings rather than simply annotating mentions of experimental methods without the context of what findings they support. Text mining tools require tuning and adaptation to achieve accurate performance. Many annotated corpora exist to enable developing and tuning text mining tools; however, none currently provides annotations of evidence based on the extensive and widely used Evidence and Conclusion Ontology. We present the ECO-CollecTF corpus, a novel, freely available, biomedical corpus of 84 documents that captures high-quality, evidence-based statements annotated with the Evidence and Conclusion Ontology.
Collapse
Affiliation(s)
- Elizabeth T Hobbs
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Stephen M Goralski
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Ashley Mitchell
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Andrew Simpson
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Dorjan Leka
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Emmanuel Kotey
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Matt Sekira
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - James B Munro
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Suvarna Nadendla
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Rebecca Jackson
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | | | - Martin Krallinger
- Barcelona Supercomputing Center (BSC), Barcelona, Spain.,Centro Nacional de Investigaciones Oncológicas (CNIO), Madrid, Spain
| | - Michelle Giglio
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| |
Collapse
|
12
|
Maadi M, Akbarzadeh Khorshidi H, Aickelin U. A Review on Human-AI Interaction in Machine Learning and Insights for Medical Applications. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph18042121. [PMID: 33671609 PMCID: PMC7926732 DOI: 10.3390/ijerph18042121] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 02/08/2021] [Accepted: 02/12/2021] [Indexed: 11/19/2022]
Abstract
Objective: To provide a human–Artificial Intelligence (AI) interaction review for Machine Learning (ML) applications to inform how to best combine both human domain expertise and computational power of ML methods. The review focuses on the medical field, as the medical ML application literature highlights a special necessity of medical experts collaborating with ML approaches. Methods: A scoping literature review is performed on Scopus and Google Scholar using the terms “human in the loop”, “human in the loop machine learning”, and “interactive machine learning”. Peer-reviewed papers published from 2015 to 2020 are included in our review. Results: We design four questions to investigate and describe human–AI interaction in ML applications. These questions are “Why should humans be in the loop?”, “Where does human–AI interaction occur in the ML processes?”, “Who are the humans in the loop?”, and “How do humans interact with ML in Human-In-the-Loop ML (HILML)?”. To answer the first question, we describe three main reasons regarding the importance of human involvement in ML applications. To address the second question, human–AI interaction is investigated in three main algorithmic stages: 1. data producing and pre-processing; 2. ML modelling; and 3. ML evaluation and refinement. The importance of the expertise level of the humans in human–AI interaction is described to answer the third question. The number of human interactions in HILML is grouped into three categories to address the fourth question. We conclude the paper by offering a discussion on open opportunities for future research in HILML.
Collapse
|
13
|
Neves M, Ševa J. An extensive review of tools for manual annotation of documents. Brief Bioinform 2021; 22:146-163. [PMID: 31838514 PMCID: PMC7820865 DOI: 10.1093/bib/bbz130] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2019] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Annotation tools are applied to build training and test corpora, which are essential for the development and evaluation of new natural language processing algorithms. Further, annotation tools are also used to extract new information for a particular use case. However, owing to the high number of existing annotation tools, finding the one that best fits particular needs is a demanding task that requires searching the scientific literature followed by installing and trying various tools. METHODS We searched for annotation tools and selected a subset of them according to five requirements with which they should comply, such as being Web-based or supporting the definition of a schema. We installed the selected tools (when necessary), carried out hands-on experiments and evaluated them using 26 criteria that covered functional and technical aspects. We defined each criterion on three levels of matches and a score for the final evaluation of the tools. RESULTS We evaluated 78 tools and selected the following 15 for a detailed evaluation: BioQRator, brat, Catma, Djangology, ezTag, FLAT, LightTag, MAT, MyMiner, PDFAnno, prodigy, tagtog, TextAE, WAT-SL and WebAnno. Full compliance with our 26 criteria ranged from only 9 up to 20 criteria, which demonstrated that some tools are comprehensive and mature enough to be used on most annotation projects. The highest score of 0.81 was obtained by WebAnno (of a maximum value of 1.0).
Collapse
Affiliation(s)
- Mariana Neves
- German Centre for the Protection of Laboratory Animals (BfR), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Jurica Ševa
- German Centre for the Protection of Laboratory Animals (BfR), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
14
|
Rahman P, Nandi A, Hebert C. Amplifying Domain Expertise in Clinical Data Pipelines. JMIR Med Inform 2020; 8:e19612. [PMID: 33151150 PMCID: PMC7677017 DOI: 10.2196/19612] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 07/07/2020] [Accepted: 07/22/2020] [Indexed: 11/28/2022] Open
Abstract
Digitization of health records has allowed the health care domain to adopt data-driven algorithms for decision support. There are multiple people involved in this process: a data engineer who processes and restructures the data, a data scientist who develops statistical models, and a domain expert who informs the design of the data pipeline and consumes its results for decision support. Although there are multiple data interaction tools for data scientists, few exist to allow domain experts to interact with data meaningfully. Designing systems for domain experts requires careful thought because they have different needs and characteristics from other end users. There should be an increased emphasis on the system to optimize the experts' interaction by directing them to high-impact data tasks and reducing the total task completion time. We refer to this optimization as amplifying domain expertise. Although there is active research in making machine learning models more explainable and usable, it focuses on the final outputs of the model. However, in the clinical domain, expert involvement is needed at every pipeline step: curation, cleaning, and analysis. To this end, we review literature from the database, human-computer information, and visualization communities to demonstrate the challenges and solutions at each of the data pipeline stages. Next, we present a taxonomy of expertise amplification, which can be applied when building systems for domain experts. This includes summarization, guidance, interaction, and acceleration. Finally, we demonstrate the use of our taxonomy with a case study.
Collapse
Affiliation(s)
| | - Arnab Nandi
- The Ohio State University, Columbus, OH, United States
| | | |
Collapse
|
15
|
Islamaj R, Kwon D, Kim S, Lu Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res 2020; 48:W5-W11. [PMID: 32383756 DOI: 10.1093/nar/gkaa333] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 12/20/2022] Open
Abstract
Manually annotated data is key to developing text-mining and information-extraction algorithms. However, human annotation requires considerable time, effort and expertise. Given the rapid growth of biomedical literature, it is paramount to build tools that facilitate speed and maintain expert quality. While existing text annotation tools may provide user-friendly interfaces to domain experts, limited support is available for figure display, project management, and multi-user team annotation. In response, we developed TeamTat (https://www.teamtat.org), a web-based annotation tool (local setup available), equipped to manage team annotation projects engagingly and efficiently. TeamTat is a novel tool for managing multi-user, multi-label document annotation, reflecting the entire production life cycle. Project managers can specify annotation schema for entities and relations and select annotator(s) and distribute documents anonymously to prevent bias. Document input format can be plain text, PDF or BioC (uploaded locally or automatically retrieved from PubMed/PMC), and output format is BioC with inline annotations. TeamTat displays figures from the full text for the annotator's convenience. Multiple users can work on the same document independently in their workspaces, and the team manager can track task completion. TeamTat provides corpus quality assessment via inter-annotator agreement statistics, and a user-friendly interface convenient for annotation review and inter-annotator disagreement resolution to improve corpus quality.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Dongseop Kwon
- School of Software Convergence, Myongji University, Seoul 03674, South Korea
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
16
|
ADDI: Recommending alternatives for drug-drug interactions with negative health effects. Comput Biol Med 2020; 125:103969. [PMID: 32836102 DOI: 10.1016/j.compbiomed.2020.103969] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 08/09/2020] [Accepted: 08/09/2020] [Indexed: 11/21/2022]
Abstract
Investigating the interactions among various drugs is an indispensable issue in the field of computational biology. Scientific literature represents a rich source for the retrieval of knowledge about the interactions between drugs. Predicting drug-drug interaction (DDI) types will help biologists to evade hazardous drug interactions and support them in discovering potential alternatives that increase therapeutic efficacy and reduce toxicity. In this paper, we propose a general-purpose method called ADDI (standing for Alternative Drug-Drug Interaction) that applies deep learning on PubMed abstracts to predict interaction types among drugs. As an application, ADDI recommends alternatives for drug-drug interactions (DDIs) which have Negative Health Effects Types (NHETs). ADDI clearly outperforms state-of-the-art methods, on average by 13%, with respect to accuracy by using only the textual content of the online PubMed papers. Additionally, manual evaluation of ADDI indicates high precision in recommending alternatives for DDIs with NHETs.
Collapse
|
17
|
Garcia-Pelaez J, Rodriguez D, Medina-Molina R, Garcia-Rivas G, Jerjes-Sánchez C, Trevino V. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5280306. [PMID: 30624653 PMCID: PMC6323318 DOI: 10.1093/database/bay137] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Accepted: 12/02/2018] [Indexed: 11/13/2022]
Abstract
Background and objective Analysis, annotation and curation of biomedical scientific literature is a recurrent task in biomedical research, database curation and clinics. Commonly, the reading is centered on concepts such as genes, diseases or molecules. Database curators may also need to annotate published abstracts related to a specific topic. However, few free and intuitive tools exist to assist users in this context. Therefore, we developed PubTerm, a web tool to organize, categorize, curate and annotate a large number of PubMed abstracts related to biological entities such as genes, diseases, chemicals, species, sequence variants and other related information. Methods A variety of interfaces were implemented to facilitate curation and annotation, including the organization of abstracts by terms, by the co-occurrence of terms or by specific phrases. Information includes statistics on the occurrence of terms. The abstracts, terms and other related information can be annotated and categorized using user-defined categories. The session information can be saved and restored, and the data can be exported to other formats. Results The pipeline in PubTerm starts by specifying a PubMed query or list of PubMed identifiers. Then, the user can specify three lists of categories and specify what information will be highlighted in which colors. The user then utilizes the `term view’ to organize the abstracts by gene, disease, species or other information to facilitate the annotation and categorization of terms or abstracts. Other views also facilitate the exploration of abstracts and connections between terms. We have used PubTerm to quickly and efficiently curate collections of more than 400 abstracts that mention more than 350 genes to generate revised lists of susceptibility genes for diseases. An example is provided for pulmonary arterial hypertension. Conclusions PubTerm saves time for literature revision by assisting with annotation organization and knowledge acquisition.
Collapse
Affiliation(s)
- José Garcia-Pelaez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - David Rodriguez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Roberto Medina-Molina
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Gerardo Garcia-Rivas
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Carlos Jerjes-Sánchez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Victor Trevino
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| |
Collapse
|