Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Comeau DC, Wei CH, Islamaj Doğan R, Lu Z. PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics 2020;35:3533-3535. [PMID: 30715220 DOI: 10.1093/bioinformatics/btz070] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 01/17/2018] [Accepted: 01/28/2019] [Indexed: 12/19/2022] Open

For:	Comeau DC, Wei CH, Islamaj Doğan R, Lu Z. PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics 2020;35:3533-3535. [PMID: 30715220 DOI: 10.1093/bioinformatics/btz070] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 01/17/2018] [Accepted: 01/28/2019] [Indexed: 12/19/2022] Open

Number

Cited by Other Article(s)

Vollmar M, Tirunagari S, Harrus D, Armstrong D, Gáborová R, Gupta D, Afonso MQL, Evans G, Velankar S. Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature. Sci Data 2024;11:1032. [PMID: 39333508 PMCID: PMC11436914 DOI: 10.1038/s41597-024-03841-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 08/29/2024] [Indexed: 09/29/2024] Open

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen LJ. RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature. Database (Oxford) 2024;2024:baae095. [PMID: 39265993 PMCID: PMC11394941 DOI: 10.1093/database/baae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/31/2024] [Accepted: 08/16/2024] [Indexed: 09/14/2024]

Mehryary F, Nastou K, Ohta T, Jensen LJ, Pyysalo S. STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature. BIOINFORMATICS (OXFORD, ENGLAND) 2024;40:btae552. [PMID: 39276156 PMCID: PMC11441320 DOI: 10.1093/bioinformatics/btae552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 07/01/2024] [Accepted: 09/12/2024] [Indexed: 09/16/2024]

Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. Improving dictionary-based named entity recognition with deep learning. Bioinformatics 2024;40:ii45-ii52. [PMID: 39230709 PMCID: PMC11373323 DOI: 10.1093/bioinformatics/btae402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024;2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Collapse

Bibal A, Salem NM, Cardon R, White EK, Acuna DE, Burke R, Hunter LE. RecSOI: recommending research directions using statements of ignorance. J Biomed Semantics 2024;15:2. [PMID: 38650032 PMCID: PMC11034121 DOI: 10.1186/s13326-024-00304-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/25/2024] Open

Chandrabhatla AS, Narahari AK, Horgan TM, Patel PD, Sturek JM, Davis CL, Jackson PEH, Bell TD. Machine Learning-based Analysis of Publications Funded by the National Institutes of Health's Initial COVID-19 Pandemic Response. Open Forum Infect Dis 2024;11:ofae156. [PMID: 38659624 PMCID: PMC11041405 DOI: 10.1093/ofid/ofae156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Accepted: 03/14/2024] [Indexed: 04/26/2024] Open

Abstract

Background

The National Institutes of Health (NIH) mobilized more than $4 billion in extramural funding for the COVID-19 pandemic. Assessing the research output from this effort is crucial to understanding how the scientific community leveraged federal funding and responded to this public health crisis.

Methods

NIH-funded COVID-19 grants awarded between January 2020 and December 2021 were identified from NIH Research Portfolio Online Reporting Tools Expenditures and Results using the "COVID-19 Response" filter. PubMed identifications of publications under these grants were collected and the NIH iCite tool was used to determine citation counts and focus (eg, clinical, animal). iCite and the NIH's LitCOVID database were used to identify publications directly related to COVID-19. Publication titles and Medical Subject Heading terms were used as inputs to a machine learning-based model built to identify common topics/themes within the publications.

Results and Conclusions

We evaluated 2401 grants that resulted in 14 654 publications. The majority of these papers were published in peer-reviewed journals, though 483 were published to preprint servers. In total, 2764 (19%) papers were directly related to COVID-19 and generated 252 029 citations. These papers were mostly clinically focused (62%), followed by cell/molecular (32%), and animal focused (6%). Roughly 60% of preprint publications were cell/molecular-focused, compared with 26% of nonpreprint publications. The machine learning-based model identified the top 3 research topics to be clinical trials and outcomes research (8.5% of papers), coronavirus-related heart and lung damage (7.3%), and COVID-19 transmission/epidemiology (7.2%). This study provides key insights regarding how researchers leveraged federal funding to study the COVID-19 pandemic during its initial phase.

Collapse

Nievas Offidani MA, Delrieux CA. Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990-2023). Data Brief 2024;52:110008. [PMID: 38235175 PMCID: PMC10792687 DOI: 10.1016/j.dib.2023.110008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/19/2023] [Accepted: 12/20/2023] [Indexed: 01/19/2024] Open

Bramley R, Howe S, Marmanis H. Notes on the data quality of bibliographic records from the MEDLINE database. Database (Oxford) 2023;2023:baad070. [PMID: 37935584 PMCID: PMC10630407 DOI: 10.1093/database/baad070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 09/01/2023] [Accepted: 10/04/2023] [Indexed: 11/09/2023]

Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023;10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open

Wang L, Ambite JL, Appaji A, Bijsterbosch J, Dockes J, Herrick R, Kogan A, Lander H, Marcus D, Moore SM, Poline JB, Rajasekar A, Sahoo SS, Turner MD, Wang X, Wang Y, Turner JA. NeuroBridge: a prototype platform for discovery of the long-tail neuroimaging data. Front Neuroinform 2023;17:1215261. [PMID: 37720825 PMCID: PMC10500076 DOI: 10.3389/fninf.2023.1215261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 08/01/2023] [Indexed: 09/19/2023] Open

Abstract

Introduction

Open science initiatives have enabled sharing of large amounts of already collected data. However, significant gaps remain regarding how to find appropriate data, including underutilized data that exist in the long tail of science. We demonstrate the NeuroBridge prototype and its ability to search PubMed Central full-text papers for information relevant to neuroimaging data collected from schizophrenia and addiction studies.

Methods

The NeuroBridge architecture contained the following components: (1) Extensible ontology for modeling study metadata: subject population, imaging techniques, and relevant behavioral, cognitive, or clinical data. Details are described in the companion paper in this special issue; (2) A natural-language based document processor that leveraged pre-trained deep-learning models on a small-sample document corpus to establish efficient representations for each article as a collection of machine-recognized ontological terms; (3) Integrated search using ontology-driven similarity to query PubMed Central and NeuroQuery, which provides fMRI activation maps along with PubMed source articles.

Results

The NeuroBridge prototype contains a corpus of 356 papers from 2018 to 2021 describing schizophrenia and addiction neuroimaging studies, of which 186 were annotated with the NeuroBridge ontology. The search portal on the NeuroBridge website https://neurobridges.org/ provides an interactive Query Builder, where the user builds queries by selecting NeuroBridge ontology terms to preserve the ontology tree structure. For each return entry, links to the PubMed abstract as well as to the PMC full-text article, if available, are presented. For each of the returned articles, we provide a list of clinical assessments described in the Section "Methods" of the article. Articles returned from NeuroQuery based on the same search are also presented.

Conclusion

The NeuroBridge prototype combines ontology-based search with natural-language text-mining approaches to demonstrate that papers relevant to a user's research question can be identified. The NeuroBridge prototype takes a first step toward identifying potential neuroimaging data described in full-text papers. Toward the overall goal of discovering "enough data of the right kind," ongoing work includes validating the document processor with a larger corpus, extending the ontology to include detailed imaging data, and extracting information regarding data availability from the returned publications and incorporating XNAT-based neuroimaging databases to enhance data accessibility.

Collapse

Affiliation(s)

Lei Wang Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
José Luis Ambite Information Sciences Institute and Computer Science, University of Southern California, Los Angeles, CA, United States
Abhishek Appaji Department of Medical Electronics Engineering, BMS College of Engineering, Bangalore, India
Janine Bijsterbosch Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
Jerome Dockes Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada
Rick Herrick Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
Alex Kogan Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
Howard Lander Renaissance Computing Institute, Chapel Hill, NC, United States
Daniel Marcus Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
Stephen M. Moore Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
Jean-Baptiste Poline Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada
Arcot Rajasekar Renaissance Computing Institute, Chapel Hill, NC, United States School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
Satya S. Sahoo Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, United States
Matthew D. Turner Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
Xiaochen Wang College of Information Sciences and Technology, Pennsylvania State University, State College, PA, United States
Yue Wang School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
Jessica A. Turner Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States

Collapse

Lin M, Hou B, Mishra S, Yao T, Huo Y, Yang Q, Wang F, Shih G, Peng Y. Enhancing thoracic disease detection using chest X-rays from PubMed Central Open Access. Comput Biol Med 2023;159:106962. [PMID: 37094464 PMCID: PMC10349296 DOI: 10.1016/j.compbiomed.2023.106962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 03/26/2023] [Accepted: 04/18/2023] [Indexed: 04/26/2023]

Islamaj R, Leaman R, Cissel D, Coss C, Denicola J, Fisher C, Guzman R, Kochar PG, Miliaras N, Punske Z, Sekiya K, Trinh D, Whitman D, Schmidt S, Lu Z. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database (Oxford) 2022;2022:baac102. [PMID: 36458799 PMCID: PMC9716560 DOI: 10.1093/database/baac102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 10/17/2022] [Accepted: 11/28/2022] [Indexed: 12/03/2022]

Abstract

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/NLM-Chem-BC7-corpus/.

Collapse

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022;17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open

Abstract

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.

Collapse

Feng Z, Shen Z, Li H, Li S. e-TSN: an interactive visual exploration platform for target-disease knowledge mapping from literature. Brief Bioinform 2022;23:bbac465. [PMID: 36347537 PMCID: PMC9677481 DOI: 10.1093/bib/bbac465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/20/2022] [Accepted: 09/27/2022] [Indexed: 11/10/2022] Open

Kim W, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. Towards a unified search: Improving PubMed retrieval with full text. J Biomed Inform 2022;134:104211. [PMID: 36152950 PMCID: PMC9561061 DOI: 10.1016/j.jbi.2022.104211] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 09/12/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]

Abstract

OBJECTIVE

A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance.

MATERIALS AND METHODS

For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness.

RESULTS AND CONCLUSIONS

Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.

Collapse

Artificial Intelligence-Based Medical Data Mining. J Pers Med 2022;12:jpm12091359. [PMID: 36143144 PMCID: PMC9501106 DOI: 10.3390/jpm12091359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/02/2022] [Accepted: 08/17/2022] [Indexed: 11/17/2022] Open

Grissa D, Junge A, Oprea TI, Jensen LJ. Diseases 2.0: a weekly updated database of disease–gene associations from text mining and data integration. Database (Oxford) 2022;2022:6554833. [PMID: 35348648 PMCID: PMC9216524 DOI: 10.1093/database/baac019] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 02/14/2022] [Accepted: 03/11/2022] [Indexed: 12/04/2022]

Beck T, Shorter T, Hu Y, Li Z, Sun S, Popovici CM, McQuibban NAR, Makraduli F, Yeung CS, Rowlands T, Posma JM. Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature. Front Digit Health 2022;4:788124. [PMID: 35243479 PMCID: PMC8885717 DOI: 10.3389/fdgth.2022.788124] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 01/21/2022] [Indexed: 11/18/2022] Open

Abstract

To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.

Collapse

Affiliation(s)

Tim Beck Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom Health Data Research UK (HDR UK), London, United Kingdom *Correspondence: Tim Beck
Tom Shorter Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom
Yan Hu Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom Department of Surgery and Cancer, Imperial College London, London, United Kingdom
Zhuoyu Li Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom
Shujian Sun Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom
Casiana M. Popovici Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom Department of Surgery and Cancer, Imperial College London, London, United Kingdom
Nicholas A. R. McQuibban Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom Centre for Integrative Systems Biology and Bioinformatics (CISBIO), Department of Life Sciences, Imperial College London, London, United Kingdom
Filip Makraduli Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom
Cheng S. Yeung Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom
Thomas Rowlands Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom
Joram M. Posma Health Data Research UK (HDR UK), London, United Kingdom Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom Joram M. Posma

Collapse

Nicholson DN, Rubinetti V, Hu D, Thielk M, Hunter LE, Greene CS. Examining linguistic shifts between preprints and publications. PLoS Biol 2022;20:e3001470. [PMID: 35104289 PMCID: PMC8806061 DOI: 10.1371/journal.pbio.3001470] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 11/05/2021] [Indexed: 11/19/2022] Open

Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022;23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure.

RESULTS

To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure.

CONCLUSIONS

We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Collapse

Software review: The JATSdecoder package-extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central's open access database. Scientometrics 2021;126:9585-9601. [PMID: 34720253 PMCID: PMC8542361 DOI: 10.1007/s11192-021-04162-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 09/08/2021] [Indexed: 11/17/2022]

Bolt T, Nomi JS, Bzdok D, Uddin LQ. Educating the future generation of researchers: A cross-disciplinary survey of trends in analysis methods. PLoS Biol 2021;19:e3001313. [PMID: 34324488 PMCID: PMC8321514 DOI: 10.1371/journal.pbio.3001313] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 06/07/2021] [Indexed: 12/20/2022] Open

Abstract

Methods for data analysis in the biomedical, life, and social (BLS) sciences are developing at a rapid pace. At the same time, there is increasing concern that education in quantitative methods is failing to adequately prepare students for contemporary research. These trends have led to calls for educational reform to undergraduate and graduate quantitative research method curricula. We argue that such reform should be based on data-driven insights into within- and cross-disciplinary use of analytic methods. Our survey of peer-reviewed literature analyzed approximately 1.3 million openly available research articles to monitor the cross-disciplinary mentions of analytic methods in the past decade. We applied data-driven text mining analyses to the "Methods" and "Results" sections of a large subset of this corpus to identify trends in analytic method mentions shared across disciplines, as well as those unique to each discipline. We found that the t test, analysis of variance (ANOVA), linear regression, chi-squared test, and other classical statistical methods have been and remain the most mentioned analytic methods in biomedical, life science, and social science research articles. However, mentions of these methods have declined as a percentage of the published literature between 2009 and 2020. On the other hand, multivariate statistical and machine learning approaches, such as artificial neural networks (ANNs), have seen a significant increase in the total share of scientific publications. We also found unique groupings of analytic methods associated with each BLS science discipline, such as the use of structural equation modeling (SEM) in psychology, survival models in oncology, and manifold learning in ecology. We discuss the implications of these findings for education in statistics and research methods, as well as within- and cross-disciplinary collaboration.

Collapse

Luo L, Yan S, Lai PT, Veltri D, Oler A, Xirasagar S, Ghosh R, Similuk M, Robinson PN, Lu Z. PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology. Bioinformatics 2021;37:1884-1890. [PMID: 33471061 PMCID: PMC11025364 DOI: 10.1093/bioinformatics/btab019] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 11/20/2020] [Accepted: 01/11/2021] [Indexed: 11/14/2022] Open

Abstract

MOTIVATION

Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation.

RESULTS

In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.

AVAILABILITYAND IMPLEMENTATION

The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Affiliation(s)

Ling Luo National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
Shankai Yan National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
Daniel Veltri Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA
Andrew Oler Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA
Sandhya Xirasagar Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA
Rajarshi Ghosh Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA
Morgan Similuk Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA
Peter N Robinson The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA

Collapse

Su J, Wu Y, Ting HF, Lam TW, Luo R. RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion. NAR Genom Bioinform 2021;3:lqab062. [PMID: 34235433 PMCID: PMC8256824 DOI: 10.1093/nargab/lqab062] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 06/16/2021] [Accepted: 06/23/2021] [Indexed: 01/06/2023] Open

Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021;22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open

Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021;8:91. [PMID: 33767203 PMCID: PMC7994842 DOI: 10.1038/s41597-021-00875-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 01/19/2021] [Indexed: 11/13/2022] Open

Affiliation(s)

Rezarta Islamaj National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Robert Leaman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Sun Kim National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Dongseop Kwon National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Chih-Hsuan Wei National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Donald C Comeau National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Yifan Peng National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
David Cissel National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Cathleen Coss National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Carol Fisher National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Rob Guzman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Preeti Gokal Kochar National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Stella Koppel National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Dorothy Trinh National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Keiko Sekiya National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Janice Ward National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Deborah Whitman National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Susan Schmidt National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.

Collapse

Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021;16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Collapse

Text Mining Gene Selection to Understand Pathological Phenotype Using Biological Big Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open

Peng Y, Tang Y, Lee S, Zhu Y, Summers RM, Lu Z. COVID-19-CT-CXR: A Freely Accessible and Weakly Labeled Chest X-Ray and CT Image Collection on COVID-19 From Biomedical Literature. IEEE TRANSACTIONS ON BIG DATA 2021;7:3-12. [PMID: 33997112 PMCID: PMC8117951 DOI: 10.1109/tbdata.2020.3035935] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 10/09/2020] [Accepted: 10/19/2020] [Indexed: 05/06/2023]

Abstract

The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature, including those that report findings on radiographs. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. Because a large portion of figures in COVID-19 articles are not CXR or CT, we designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved deep-learning (DL) performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza, another common infectious respiratory illness that may present similarly to COVID-19, and fine-tuned a baseline deep neural network to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We fine-tuned an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared 15 clinical symptoms and 20 clinical findings of COVID-19 versus those of influenza to demonstrate the disease differences in the scientific publications. Our database is unique, as the figures are retrieved along with relevant text with fine-grained descriptions, and it can be extended easily in the future. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.

Collapse

Islamaj R, Kwon D, Kim S, Lu Z. TeamTat: a collaborative text annotation tool. Nucleic Acids Res 2020;48:W5-W11. [PMID: 32383756 DOI: 10.1093/nar/gkaa333] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 12/20/2022] Open

Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020;48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open

Dai S, You R, Lu Z, Huang X, Mamitsuka H, Zhu S. FullMeSH: improving large-scale MeSH indexing with full text. Bioinformatics 2020;36:1533-1541. [PMID: 31596475 DOI: 10.1093/bioinformatics/btz756] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 08/29/2019] [Accepted: 10/03/2019] [Indexed: 01/10/2023] Open

Abstract

MOTIVATION

With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer, MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles.

RESULTS

We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: (i) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. (ii) FullMeSH integrates the evidence from different sections in a 'learning to rank' framework by combining the sparse and deep semantic representations. (iii) FullMeSH trains an Attention-based Convolutional Neural Network for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10 000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings.

AVAILABILITY AND IMPLEMENTATION

The software is available upon request.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Wang CCN, Jin J, Chang JG, Hayakawa M, Kitazawa A, Tsai JJP, Sheu PCY. Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization. BMC Med Inform Decis Mak 2020;20:208. [PMID: 32883271 PMCID: PMC7469322 DOI: 10.1186/s12911-020-01227-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/20/2020] [Indexed: 12/02/2022] Open

Weber L, Thobe K, Migueles Lozano OA, Wolf J, Leser U. PEDL: extracting protein-protein associations using deep language models and distant supervision. Bioinformatics 2020;36:i490-i498. [PMID: 32657389 PMCID: PMC7355289 DOI: 10.1093/bioinformatics/btaa430] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, Lu Z. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res 2020;47:W594-W599. [PMID: 31020319 DOI: 10.1093/nar/gkz289] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Revised: 04/05/2019] [Accepted: 04/10/2019] [Indexed: 11/15/2022] Open

Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020;47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 206] [Impact Index Per Article: 51.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open

Using Manual and Computer-Based Text-Mining to Uncover Research Trends for Apis mellifera. Vet Sci 2020;7:vetsci7020061. [PMID: 32384687 PMCID: PMC7356030 DOI: 10.3390/vetsci7020061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 04/30/2020] [Accepted: 05/02/2020] [Indexed: 12/21/2022] Open

Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y, Kinney R, Li Y, Liu Z, Merrill W, Mooney P, Murdick D, Rishi D, Sheehan J, Shen Z, Stilson B, Wade AD, Wang K, Wang NXR, Wilhelm C, Xie B, Raymond D, Weld DS, Etzioni O, Kohlmeier S. CORD-19: The Covid-19 Open Research Dataset. ARXIV 2020:arXiv:2004.10706v4. [PMID: 32510522 PMCID: PMC7251955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Figures] [Subscribe] [Scholar Register] [Revised: 07/10/2020] [Indexed: 06/11/2023]

Junge A, Jensen LJ. CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics 2020;36:264-271. [PMID: 31199464 PMCID: PMC6956794 DOI: 10.1093/bioinformatics/btz490] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 05/30/2019] [Accepted: 06/10/2019] [Indexed: 12/18/2022] Open