1
|
Zhang G, Jin Q, Jered McInerney D, Chen Y, Wang F, Cole CL, Yang Q, Wang Y, Malin BA, Peleg M, Wallace BC, Lu Z, Weng C, Peng Y. Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness. J Biomed Inform 2024; 153:104640. [PMID: 38608915 PMCID: PMC11217921 DOI: 10.1016/j.jbi.2024.104640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 04/08/2024] [Accepted: 04/09/2024] [Indexed: 04/14/2024]
Abstract
Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.
Collapse
Affiliation(s)
- Gongbo Zhang
- Columbia University, Department of Biomedical Informatics, New York, 10032, USA
| | - Qiao Jin
- National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, Bethesda, 20894, USA
| | | | - Yong Chen
- University of Pennsylvania, Department of Biostatistics, Epidemiology and Informatics, Philadelphia 19104, USA
| | - Fei Wang
- Weill Cornell Medicine, Department of Population Health Sciences, New York 10065, USA; Weill Cornell Medicine, Institute of AI for Digital Health, New York 10065, USA
| | - Curtis L Cole
- Weill Cornell Medicine, Department of Population Health Sciences, New York 10065, USA; Weill Cornell Medicine, Department of Medicine, New York 10065, USA
| | - Qian Yang
- Cornell University, Computing and Information Science, Ithaca 14853, USA
| | - Yanshan Wang
- University of Pittsburgh, Department of Health Information Management, Pittsburgh 15260, USA
| | - Bradley A Malin
- Vanderbilt University Medical Center, Department of Biomedical Informatics, Nashville 37203, USA; Vanderbilt University Medical Center, Department of Biostatistics, Nashville 37203, USA; Vanderbilt University, Department of Computer Science, Nashville 37212, USA
| | - Mor Peleg
- University of Haifa, Department of Information Systems, Haifa 3498838, Israel
| | - Byron C Wallace
- Northeastern University, the Khoury College of Computer Sciences, Boston 02115, USA
| | - Zhiyong Lu
- National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, Bethesda, 20894, USA
| | - Chunhua Weng
- Columbia University, Department of Biomedical Informatics, New York, 10032, USA.
| | - Yifan Peng
- Weill Cornell Medicine, Department of Population Health Sciences, New York 10065, USA.
| |
Collapse
|
2
|
Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024; 25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools. RESULTS We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances. CONCLUSIONS MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.
Collapse
Affiliation(s)
- Ornella Irrera
- Department of Information Engineering, University of Padova, Padua, Italy.
| | - Stefano Marchesin
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
3
|
Maraver P, Tecuatl C, Ascoli GA. Automatic identification of scientific publications describing digital reconstructions of neural morphology. Brain Inform 2023; 10:23. [PMID: 37684527 PMCID: PMC10491540 DOI: 10.1186/s40708-023-00202-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/06/2023] [Indexed: 09/10/2023] Open
Abstract
The increasing number of peer-reviewed publications constitutes a challenge for biocuration. For example, NeuroMorpho.Org, a sharing platform for digital reconstructions of neural morphology, must evaluate more than 6000 potentially relevant articles per year to identify data of interest. Here, we describe a tool that uses natural language processing and deep learning to assess the likelihood of a publication to be relevant for the project. The tool automatically identifies articles describing digitally reconstructed neural morphologies with high accuracy. Its processing rate of 900 publications per hour is not only amply sufficient to autonomously track new research, but also allowed the successful evaluation of older publications backlogged due to limited human resources. The number of bio-entities found since launching the tool almost doubled while greatly reducing manual labor. The classification tool is open source, configurable, and simple to use, making it extensible to other biocuration projects.
Collapse
Affiliation(s)
- Patricia Maraver
- Bioengineering Department; College of Engineering and Computing, George Mason University, Fairfax, VA, USA
- Center for Neural Informatics, Structures, & Plasticity; Krasnow Institute for Advanced Study, George Mason University, Fairfax, VA, USA
| | - Carolina Tecuatl
- Bioengineering Department; College of Engineering and Computing, George Mason University, Fairfax, VA, USA
- Center for Neural Informatics, Structures, & Plasticity; Krasnow Institute for Advanced Study, George Mason University, Fairfax, VA, USA
| | - Giorgio A Ascoli
- Bioengineering Department; College of Engineering and Computing, George Mason University, Fairfax, VA, USA.
- Center for Neural Informatics, Structures, & Plasticity; Krasnow Institute for Advanced Study, George Mason University, Fairfax, VA, USA.
| |
Collapse
|
4
|
Arighi CN. Hagit Shatkay-Reshef 1965-2022. BIOINFORMATICS ADVANCES 2022; 2:vbac012. [PMID: 36699359 PMCID: PMC9710649 DOI: 10.1093/bioadv/vbac012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Indexed: 01/28/2023]
Affiliation(s)
- Cecilia N Arighi
- Department of Computer and Information Sciences, Ammon-Pinizzotto Biopharmaceutical Innovation Building, Newark, DE 19713, USA
| |
Collapse
|
5
|
Thielmann A, Weisser C, Krenz A, Säfken B. Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling. J Appl Stat 2021; 50:574-591. [PMID: 36819086 PMCID: PMC9930816 DOI: 10.1080/02664763.2021.1919063] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2022]
Abstract
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
Collapse
Affiliation(s)
- Anton Thielmann
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Anton Thielmann
| | - Christoph Weisser
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Campus-Institut Data Science (CIDAS), Göttingen, Germany
| | - Astrid Krenz
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Digital Futures at Work Research Centre, University of Sussex, Brighton, UK
| | - Benjamin Säfken
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Campus-Institut Data Science (CIDAS), Göttingen, Germany
| |
Collapse
|
6
|
Jiang X, Li P, Kadin J, Blake JA, Ringwald M, Shatkay H. Integrating image caption information into biomedical document classification in support of biocuration. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5819650. [PMID: 32294192 PMCID: PMC7159034 DOI: 10.1093/database/baaa024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/10/2020] [Accepted: 03/11/2020] [Indexed: 01/12/2023]
Abstract
Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:
Collapse
Affiliation(s)
- Xiangying Jiang
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| | - Pengyuan Li
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| | - James Kadin
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Judith A Blake
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Martin Ringwald
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Hagit Shatkay
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| |
Collapse
|
7
|
Nasir IM, Khan MA, Yasmin M, Shah JH, Gabryel M, Scherer R, Damaševičius R. Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training. SENSORS 2020; 20:s20236793. [PMID: 33261136 PMCID: PMC7730850 DOI: 10.3390/s20236793] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 11/15/2020] [Accepted: 11/25/2020] [Indexed: 11/18/2022]
Abstract
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
Collapse
Affiliation(s)
- Inzamam Mashood Nasir
- Department of Computer Science, HITEC University, Taxila 47080, Pakistan; (I.M.N.); (M.A.K.)
| | - Muhammad Attique Khan
- Department of Computer Science, HITEC University, Taxila 47080, Pakistan; (I.M.N.); (M.A.K.)
| | - Mussarat Yasmin
- Department of Computer Science, COMSATS University Islamabad, Wah Campus, Wah Cantonment 47040, Pakistan; (M.Y.); (J.H.S.)
| | - Jamal Hussain Shah
- Department of Computer Science, COMSATS University Islamabad, Wah Campus, Wah Cantonment 47040, Pakistan; (M.Y.); (J.H.S.)
| | - Marcin Gabryel
- Department of Intelligent Computer Systems, Częstochowa University of Technology, 42-200 Częstochowa, Poland; (M.G.); (R.S.)
| | - Rafał Scherer
- Department of Intelligent Computer Systems, Częstochowa University of Technology, 42-200 Częstochowa, Poland; (M.G.); (R.S.)
| | - Robertas Damaševičius
- Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland
- Correspondence:
| |
Collapse
|
8
|
Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé EA. Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources. Metabolites 2020; 10:E202. [PMID: 32429287 PMCID: PMC7281435 DOI: 10.3390/metabo10050202] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/07/2020] [Accepted: 05/13/2020] [Indexed: 02/06/2023] Open
Abstract
As researchers are increasingly able to collect data on a large scale from multiple clinical and omics modalities, multi-omics integration is becoming a critical component of metabolomics research. This introduces a need for increased understanding by the metabolomics researcher of computational and statistical analysis methods relevant to multi-omics studies. In this review, we discuss common types of analyses performed in multi-omics studies and the computational and statistical methods that can be used for each type of analysis. We pinpoint the caveats and considerations for analysis methods, including required parameters, sample size and data distribution requirements, sources of a priori knowledge, and techniques for the evaluation of model accuracy. Finally, for the types of analyses discussed, we provide examples of the applications of corresponding methods to clinical and basic research. We intend that our review may be used as a guide for metabolomics researchers to choose effective techniques for multi-omics analyses relevant to their field of study.
Collapse
Affiliation(s)
- Tara Eicher
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
| | - Garrett Kinnebrew
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Bioinformatics Shared Resource Group, The Ohio State University, Columbus, OH 43210, USA
| | - Andrew Patt
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
| | - Kyle Spencer
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
- Nationwide Children’s Research Hospital, Columbus, OH 43210, USA
| | - Kevin Ying
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Molecular, Cellular and Developmental Biology Program, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
| | - Raghu Machiraju
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
- Department of Pathology, Wexner Medical Center, The Ohio State University, Columbus, OH 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH 43210, USA
| | - Ewy A. Mathé
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
| |
Collapse
|
9
|
Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
Collapse
Affiliation(s)
- Douglas Teodoro
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nona Naderi
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Emilie Pasche
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Gobeill
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Cecilia N Arighi
- Center of Bioinformatics and Computational Biology, 15 Innovation Way, 19711, Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Patrick Ruch
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
10
|
Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020; 2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]
Abstract
Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Juancarlos N Chan
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
11
|
PGxMine: Text mining for curation of PharmGKB. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:611-622. [PMID: 31797632 PMCID: PMC6917032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB's scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.
Collapse
|