Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 2018;14:e1006390. [PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/23/2018] [Accepted: 07/24/2018] [Indexed: 11/18/2022] Open

For:	Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 2018;14:e1006390. [PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/23/2018] [Accepted: 07/24/2018] [Indexed: 11/18/2022] Open

Number

Cited by Other Article(s)

Gabud R, Lapitan P, Mariano V, Mendoza E, Pampolina N, Clariño MAA, Batista-Navarro R. Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species. Front Artif Intell 2024;7:1371411. [PMID: 38845683 PMCID: PMC11153722 DOI: 10.3389/frai.2024.1371411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 05/10/2024] [Indexed: 06/09/2024] Open

Abstract

Introduction

Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species' time-sensitive reproductive conditions and location-specific habitats.

Methods

We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches.

Results

Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.

Collapse

YOUSEF M, ALLMER J. Deep learning in bioinformatics. Turk J Biol 2023;47:366-382. [PMID: 38681776 PMCID: PMC11045206 DOI: 10.55730/1300-0152.2671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 12/28/2023] [Accepted: 12/18/2023] [Indexed: 05/01/2024] Open

Arabi-Jeshvaghani F, Javadi-Zarnaghi F, Löchel HF, Martin R, Heider D. LAMPPrimerBank, a manually curated database of experimentally validated loop-mediated isothermal amplification primers for detection of respiratory pathogens. Infection 2023;51:1809-1818. [PMID: 37828369 DOI: 10.1007/s15010-023-02100-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/13/2023] [Indexed: 10/14/2023]

Abstract

PURPOSE AND METHODS

The emergence of coronavirus disease 2019 (COVID-19) has once again affirmed the significant threat of respiratory infections to global public health and the utmost importance of prompt diagnosis in managing and mitigating any pandemic. The nucleic acid amplification test (NAAT) is the primary detection method for most pathogens. Loop-mediated isothermal amplification (LAMP) is a rapid, simple, sensitive, and specific epitome of isothermal NAAT performed using a set of four to six primers. Primer design is a fundamental step in LAMP assays, with several complexities and experimental screening requirements. To address this challenge, an online database is presented here. Its workflow comprises three steps: literature aggregation, data curation, and database and website implementation.

RESULTS

LAMPPrimerBank ( https://lampprimerbank.mathematik.uni-marburg.de ) is a manually curated database dedicated to experimentally validated LAMP primers, their peculiarities of assays, and accompanying literature, with a primary emphasis on respiratory pathogens. LAMPPrimerBank, with its user-friendly web interface and an open application programming interface, enables the accelerated and facile exploration, comparison, and exportation of LAMP primer sequences and their respective information from the massively scattered literature. LAMPPrimerBank currently comprises LAMP primers for diagnosing viral, bacterial, and fungal respiratory pathogens. Additionally, to address the challenge of false-positive results generated by nonspecific amplifications, LAMPPrimerBank computationally predicted and visualized the sizes of LAMP products for recorded primer sets in the database.

CONCLUSION

LAMPPrimerBank, as a pioneering database in the rapidly expanding field of isothermal NAAT, endeavors to confront the two challenges of the LAMP: primer design and discrimination of false-positive results.

Collapse

Maraver P, Tecuatl C, Ascoli GA. Automatic identification of scientific publications describing digital reconstructions of neural morphology. Brain Inform 2023;10:23. [PMID: 37684527 PMCID: PMC10491540 DOI: 10.1186/s40708-023-00202-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/06/2023] [Indexed: 09/10/2023] Open

Long E, Wan P, Chen Q, Lu Z, Choi J. From function to translation: Decoding genetic susceptibility to human diseases via artificial intelligence. CELL GENOMICS 2023;3:100320. [PMID: 37388909 PMCID: PMC10300605 DOI: 10.1016/j.xgen.2023.100320] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]

Knafou J, Haas Q, Borissov N, Counotte M, Low N, Imeri H, Ipekci AM, Buitrago-Garcia D, Heron L, Amini P, Teodoro D. Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature. Syst Rev 2023;12:94. [PMID: 37277872 DOI: 10.1186/s13643-023-02247-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 04/24/2023] [Indexed: 06/07/2023] Open

Abstract

BACKGROUND

The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process.

METHODS

In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article.

RESULTS

The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset.

CONCLUSION

This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

Collapse

Maraver P, Tecuatl C, Ascoli GA. Automatic identification of scientific publications describing digital reconstructions of neural morphology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.14.527522. [PMID: 36824882 PMCID: PMC9949101 DOI: 10.1101/2023.02.14.527522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]

Luo L, Wei CH, Lai PT, Chen Q, Islamaj R, Lu Z. Assigning species information to corresponding genes by a sequence labeling framework. Database (Oxford) 2022;2022:6760187. [PMID: 36227127 PMCID: PMC9558450 DOI: 10.1093/database/baac090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 08/26/2022] [Accepted: 10/11/2022] [Indexed: 01/24/2023]

de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022;2022:baac062. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]

Affiliation(s)

Valérie de Crécy-lagard Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Rocio Amorin de Hegedus Genetics Institute, University of Florida, Gainesville, FL 32611, USA
Cecilia Arighi Department of Computer and Information Sciences, University of Delaware, Newark, DE 19713, USA
Jill Babor Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Alex Bateman European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
Ian Blaby US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Crysten Blaby-Haas Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
Alan J Bridge Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
Stephen K Burley RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Stacey Cleveland Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Lucy J Colwell Departmenf of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
Ana Conesa Spanish National Research Council, Institute for Integrative Systems Biology, Paterna, Valencia 46980, Spain
Christian Dallago TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
Antoine Danchin School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
Anita de Waard Research Collaboration Unit, Elsevier, Jericho, VT 05465, USA
Adam Deutschbauer Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Raquel Dias Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Yousong Ding Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
Gang Fang NYU-Shanghai, Shanghai 200120, China
Iddo Friedberg Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
John Gerlt Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
Joshua Goldford Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Mark Gorelik Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Benjamin M Gyori Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
Christopher Henry Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
Geoffrey Hutinet Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Marshall Jaroch Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Peter D Karp Bioinformatics Research Group, SRI International, Menlo Park, CA 94025, USA
Liudmyla Kondratova Genetics Institute, University of Florida, Gainesville, FL 32611, USA
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
Aron Marchler-Bauer National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
Maria-Jesus Martin European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
Claire McWhite Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
Gaurav D Moghe Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
Paul Monaghan Department of Agricultural Education and Communication, University of Florida, Gainesville, FL 32611, USA
Anne Morgat Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
Christopher J Mungall Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Darren A Natale Georgetown University Medical Center, Washington, DC 20007, USA
William C Nelson Biological Sciences Division, Pacific Northwest National Laboratories, Richland, WA 99354, USA
Seán O’Donoghue School of Biotechnology and Biomolecular Sciences, University of NSW, Sydney, NSW 2052, Australia
Christine Orengo Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
Katherine H O’Toole New England Biolabs, Ipswich, MA 01938, USA
Predrag Radivojac Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
Colbie Reed Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
Richard J Roberts New England Biolabs, Ipswich, MA 01938, USA
Dmitri Rodionov Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA
Irina A Rodionova Department of Bioengineering, Division of Engineering, University of California at San Diego, La Jolla, CA 92093-0412, USA
Jeffrey D Rudolf Department of Chemistry, University of Florida, Gainesville, FL 32611, USA
Lana Saleh New England Biolabs, Ipswich, MA 01938, USA
Gloria Sheynkman Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
Francoise Thibaud-Nissen National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
Paul D Thomas Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
Peter Uetz Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA 23284, USA
David Vallenet LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry 91057, France
Erica Watson Carter Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
Peter R Weigele New England Biolabs, Ipswich, MA 01938, USA
Valerie Wood Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
Elisha M Wood-Charlson Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Jin Xu Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA

Collapse

Tejani AS, Ng YS, Xi Y, Fielding JR, Browning TG, Rayan JC. Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets. Radiol Artif Intell 2022;4:e220007. [PMID: 35923377 PMCID: PMC9344209 DOI: 10.1148/ryai.220007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 06/08/2022] [Accepted: 06/14/2022] [Indexed: 06/15/2023]

Abstract

PURPOSE

To develop and evaluate domain-specific and pretrained bidirectional encoder representations from transformers (BERT) models in a transfer learning task on varying training dataset sizes to annotate a larger overall dataset.

MATERIALS AND METHODS

The authors retrospectively reviewed 69 095 anonymized adult chest radiograph reports (reports dated April 2020-March 2021). From the overall cohort, 1004 reports were randomly selected and labeled for the presence or absence of each of the following devices: endotracheal tube (ETT), enterogastric tube (NGT, or Dobhoff tube), central venous catheter (CVC), and Swan-Ganz catheter (SGC). Pretrained transformer models (BERT, PubMedBERT, DistilBERT, RoBERTa, and DeBERTa) were trained, validated, and tested on 60%, 20%, and 20%, respectively, of these reports through fivefold cross-validation. Additional training involved varying dataset sizes with 5%, 10%, 15%, 20%, and 40% of the 1004 reports. The best-performing epochs were used to assess area under the receiver operating characteristic curve (AUC) and determine run time on the overall dataset.

RESULTS

The highest average AUCs from fivefold cross-validation were 0.996 for ETT (RoBERTa), 0.994 for NGT (RoBERTa), 0.991 for CVC (PubMedBERT), and 0.98 for SGC (PubMedBERT). DeBERTa demonstrated the highest AUC for each support device trained on 5% of the training set. PubMedBERT showed a higher AUC with a decreasing training set size compared with BERT. Training and validation time was shortest for DistilBERT at 3 minutes 39 seconds on the annotated cohort.

CONCLUSION

Pretrained and domain-specific transformer models required small training datasets and short training times to create a highly accurate final model that expedites autonomous annotation of large datasets.Keywords: Informatics, Named Entity Recognition, Transfer Learning Supplemental material is available for this article. ©RSNA, 2022See also the commentary by Zech in this issue.

Collapse

Gabella C, Duvaud S, Durinx C. Managing the life cycle of a portfolio of open data resources at the SIB Swiss Institute of Bioinformatics. Brief Bioinform 2022;23:bbab478. [PMID: 34850820 PMCID: PMC8769900 DOI: 10.1093/bib/bbab478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 10/15/2021] [Accepted: 10/18/2021] [Indexed: 11/23/2022] Open

McMahon A, Lewis E, Buniello A, Cerezo M, Hall P, Sollis E, Parkinson H, Hindorff LA, Harris LW, MacArthur JA. Sequencing-based genome-wide association studies reporting standards. CELL GENOMICS 2021;1:100005. [PMID: 34870259 PMCID: PMC8637874 DOI: 10.1016/j.xgen.2021.100005] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Allot A, Lee K, Chen Q, Luo L, Lu Z. LitSuggest: a web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res 2021;49:W352-W358. [PMID: 33950204 DOI: 10.1093/nar/gkab326] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 01/02/2023] Open

Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021;22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open

Ede JM. Deep learning in electron microscopy. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abd614] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open

Jiang X, Li P, Kadin J, Blake JA, Ringwald M, Shatkay H. Integrating image caption information into biomedical document classification in support of biocuration. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021;2020:5819650. [PMID: 32294192 PMCID: PMC7159034 DOI: 10.1093/database/baaa024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/10/2020] [Accepted: 03/11/2020] [Indexed: 01/12/2023]

Abstract

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.

We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.

Database URL:

Collapse

Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res 2021;49:D1534-D1540. [PMID: 33166392 PMCID: PMC7778958 DOI: 10.1093/nar/gkaa952] [Citation(s) in RCA: 137] [Impact Index Per Article: 45.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 10/02/2020] [Accepted: 10/08/2020] [Indexed: 12/22/2022] Open

Moosavi S, Jablonka KM, Smit B. The Role of Machine Learning in the Understanding and Design of Materials. J Am Chem Soc 2020;142:20273-20287. [PMID: 33170678 PMCID: PMC7716341 DOI: 10.1021/jacs.0c09105] [Citation(s) in RCA: 89] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Indexed: 12/21/2022]

Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020;48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open

Gynura divaricata exerts hypoglycemic effects by regulating the PI3K/AKT signaling pathway and fatty acid metabolism signaling pathway. Nutr Diabetes 2020;10:31. [PMID: 32796820 PMCID: PMC7427804 DOI: 10.1038/s41387-020-00134-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 07/26/2020] [Accepted: 08/05/2020] [Indexed: 12/25/2022] Open

Abstract

OBJECTIVES

The study aimed to examine the anti-diabetic effects of Gynura divaricata (GD) and the underlying mechanism.

METHODS

Information about the chemical compositions of GD was obtained from extensive literature reports. Potential target genes were predicted using PharmMapper and analyzed using Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO). To validate the results from bioinformatics analyses, an aqueous extract of GD was administered to type 2 diabetic rats established by feeding a high-fat and high-sugar diet followed by STZ injection. Key proteins of the PI3K/AKT signaling pathway and fatty acid metabolism signaling pathway were investigated by immunoblotting.

RESULTS

The blood glucose of the rats in the GD treatment group was significantly reduced compared with the model group without treatment. GD also showed activities in reducing the levels of alanine aminotransferase (ALT), aspartate aminotransferase (AST), blood urea nitrogen (BUN), and creatinine (CREA). The levels of urine sugar (U-GLU) and urine creatinine (U-CREA) were also lowered after treatment with GD. Bioinformatics analysis showed that some pathways including metabolic pathways, insulin resistance, insulin signaling pathway, PPAR signaling pathway, bile secretion, purine metabolism, etc. may be regulated by GD. Furthermore, GD significantly increased the protein expression levels of PKM1/2, p-AKT, PI3K p85, and GLUT4 in the rat liver. In addition, the expression levels of key proteins in the fatty acid metabolism signaling pathway including AMPK, p-AMPK, PPARα, and CPT1α were significantly upregulated. The anti-apoptotic protein BCL-2/BAX expression ratio in rats was significantly upregulated after GD intervention. These results were consistent with the bioinformatics analysis results.

CONCLUSIONS

Our study suggests that GD can exert hypoglycemic effects in vivo by regulating the genes at the key nodes of the PI3K/AKT signaling pathway and fatty acid metabolism signaling pathway.

Collapse

Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, Gigante A, Valencia A, Rementeria MJ, Chadha AS, Mavridis N. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med 2020;3:81. [PMID: 32529043 PMCID: PMC7264169 DOI: 10.1038/s41746-020-0288-5] [Citation(s) in RCA: 153] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 04/28/2020] [Indexed: 01/10/2023] Open

Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 206] [Impact Index Per Article: 51.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020;2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]

Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020;16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open

Abstract

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.

Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.

Collapse

Burns GA, Li X, Peng N. Building deep learning models for evidence classification from the open access biomedical literature. Database (Oxford) 2019;2019:baz034. [PMID: 30938776 PMCID: PMC6449534 DOI: 10.1093/database/baz034] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 01/03/2019] [Accepted: 02/20/2019] [Indexed: 11/13/2022]

Hsu YY, Clyne M, Wei CH, Khoury MJ, Lu Z. Using deep learning to identify translational research in genomic medicine beyond bench to bedside. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019;2019:5309020. [PMID: 30753477 PMCID: PMC6367517 DOI: 10.1093/database/baz010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 01/15/2019] [Indexed: 12/18/2022]

Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018;2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]

Abstract

The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.

Collapse

Affiliation(s)

Aurore Britan Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Isabelle Cusin Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Valérie Hinard Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Luc Mottin Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Emilie Pasche Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Julien Gobeill Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Valentine Rech de Laval Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Anne Gleizes Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Daniel Teixeira Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Pierre-André Michel Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Patrick Ruch Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
Pascale Gaudet Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland

Collapse