1
|
Sheng J, Gero Z, Ho JC. PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2022; 2022:4470-4474. [PMID: 36382341 PMCID: PMC9652778 DOI: 10.1145/3511808.3557675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.
Collapse
Affiliation(s)
- Jiasheng Sheng
- Carnegie Mellon University Pittsburgh, Pennsylvania, USA
| | | | | |
Collapse
|
2
|
Kim W, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. Towards a unified search: Improving PubMed retrieval with full text. J Biomed Inform 2022; 134:104211. [PMID: 36152950 PMCID: PMC9561061 DOI: 10.1016/j.jbi.2022.104211] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 09/12/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]
Abstract
OBJECTIVE A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
Collapse
Affiliation(s)
- Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
| |
Collapse
|
3
|
Papageorgiou L, Alkenaris H, Zervou MI, Vlachakis D, Matalliotakis I, Spandidos DA, Bertsias G, Goulielmos GN, Eliopoulos E. Epione application: An integrated web‑toolkit of clinical genomics and personalized medicine in systemic lupus erythematosus. Int J Mol Med 2021; 49:8. [PMID: 34791504 PMCID: PMC8612305 DOI: 10.3892/ijmm.2021.5063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/02/2021] [Indexed: 12/16/2022] Open
Abstract
Genome wide association studies (GWAS) have identified autoimmune disease-associated loci, a number of which are involved in numerous disease-associated pathways. However, much of the underlying genetic and pathophysiological mechanisms remain to be elucidated. Systemic lupus erythematosus (SLE) is a chronic, highly heterogeneous auto-immune disease, characterized by differences in autoantibody profile, serum cytokines and a multi-system involvement. This study presents the Epione application, an integrated bioinformatics web-toolkit, designed to assist medical experts and researchers in more accurately diagnosing SLE. The application aims to identify the most credible gene variants and single nucleotide polymorphisms (SNPs) associated with SLE susceptibility, by using patient's genomic data to aid the medical expert in SLE diagnosis. The application contains useful knowledge of >70,000 SLE-related publications that have been analyzed, using data mining and semantic techniques, towards extracting the SLE-related genes and the corresponding SNPs. Probable genes associated with the patient's genomic profile are visualized with several graphs, including chromosome ideograms, statistic bars and regulatory networks through data mining studies with relative publications, to obtain a representative number of the most credible candidate genes and biological pathways associated with the SLE. Furthermore, an evaluation study was performed on a patient diagnosed with SLE and is presented herein. Epione has also been expanded in family-related candidate patients to evaluate its predictive power. All the recognized gene variants that were previously considered to be associated with SLE were accurately identified in the output profile of the patient, and by comparing the results, novel findings have emerged. The Epione application may assist and facilitate in early stage diagnosis by using the patients' genomic profile to compare against the list of the most predictable candidate gene variants related to SLE. Its diagnosis-oriented output presents the user with a structured set of results on variant association, position in genome and links to specific bibliography and gene network associations. The overall aim of the present study was to provide a reliable tool for the most effective study of SLE. This novel and accessible webserver tool of SLE is available at http://geneticslab.aua.gr/epione/.
Collapse
Affiliation(s)
- Louis Papageorgiou
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| | - Haris Alkenaris
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| | - Maria I Zervou
- Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Dimitriοs Vlachakis
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| | - Ioannis Matalliotakis
- Department of Obstetrics and Gynecology, Venizeleio and Pananio General Hospital of Heraklion, 71409 Heraklion, Greece
| | - Demetrios A Spandidos
- Laboratory of Clinical Virology, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - George Bertsias
- Department of Rheumatology and Clinical Immunology, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - George N Goulielmos
- Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Elias Eliopoulos
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| |
Collapse
|
4
|
Papageorgiou L, Zervou MI, Vlachakis D, Matalliotakis M, Matalliotakis I, Spandidos DA, Goulielmos GN, Eliopoulos E. Demetra Application: An integrated genotype analysis web server for clinical genomics in endometriosis. Int J Mol Med 2021; 47:115. [PMID: 33907838 PMCID: PMC8083807 DOI: 10.3892/ijmm.2021.4948] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/15/2021] [Indexed: 12/15/2022] Open
Abstract
Demetra Application is a holistic integrated and scalable bioinformatics web-based tool designed to assist medical experts and researchers in the process of diagnosing endometriosis. The application identifies the most prominent gene variants and single nucleotide polymorphisms (SNPs) causing endometriosis using the genomic data provided for the patient by a medical expert. The present study analyzed >28.000 endometriosis-related publications using data mining and semantic techniques aimed towards extracting the endometriosis-related genes and SNPs. The extracted knowledge was filtered, evaluated, annotated, classified, and stored in the Demetra Application Database (DAD). Moreover, an updated gene regulatory network with the genes implements in endometriosis was established. This was followed by the design and development of the Demetra Application, in which the generated datasets and results were included. The application was tested and presented herein with whole-exome sequencing data from seven related patients with endometriosis. Endometriosis-related SNPs and variants identified in genome-wide association studies (GWAS), whole-genome (WGS), whole-exome (WES), or targeted sequencing information were classified, annotated and analyzed in a consolidated patient profile with clinical significance information. Probable genes associated with the patient's genomic profile were visualized using several graphs, including chromosome ideograms, statistic bars and regulatory networks through data mining studies with relative publications, in an effort to obtain a representative number of the most credible candidate genes and biological pathways associated with endometriosis. An evaluation analysis was performed on seven patients from a three-generation family with endometriosis. All the recognized gene variants that were previously considered to be associated with endometriosis were properly identified in the output profile per patient, and by comparing the results, novel findings emerged. This novel and accessible webserver tool of endometriosis to assist medical experts in the clinical genomics and precision medicine procedure is available at http://geneticslab.aua.gr/.
Collapse
Affiliation(s)
- Louis Papageorgiou
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| | - Maria I Zervou
- Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| | - Michail Matalliotakis
- Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Ioannis Matalliotakis
- Department of Obstetrics and Gynecology, 'Venizeleio and Pananio' General Hospital of Heraklion, 71409 Heraklion, Greece
| | - Demetrios A Spandidos
- Laboratory of Clinical Virology, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - George N Goulielmos
- Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece
| | - Elias Eliopoulos
- Laboratory of Genetics, Department of Biotechnology, Agricultural University of Athens, 11855 Athens, Greece
| |
Collapse
|
5
|
Lee JTH, Patikas N, Kiselev VY, Hemberg M. Fast searches of large collections of single-cell data using scfind. Nat Methods 2021; 18:262-271. [PMID: 33649586 PMCID: PMC7116898 DOI: 10.1038/s41592-021-01076-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Accepted: 01/20/2021] [Indexed: 01/30/2023]
Abstract
Single-cell technologies have made it possible to profile millions of cells, but for these resources to be useful they must be easy to query and access. To facilitate interactive and intuitive access to single-cell data we have developed scfind, a single-cell analysis tool that facilitates fast search of biologically or clinically relevant marker genes in cell atlases. Using transcriptome data from six mouse cell atlases, we show how scfind can be used to evaluate marker genes, perform in silico gating, and identify both cell-type-specific and housekeeping genes. Moreover, we have developed a subquery optimization routine to ensure that long and complex queries return meaningful results. To make scfind more user friendly, we use indices of PubMed abstracts and techniques from natural language processing to allow for arbitrary queries. Finally, we show how scfind can be used for multi-omics analyses by combining single-cell ATAC-seq data with transcriptome data.
Collapse
Affiliation(s)
| | - Nikolaos Patikas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- UK Dementia Research Institute, Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
| | | | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
- Evergrande Center for Immunologic Disease, Harvard Medical School and Brigham and Women's Hospital, Boston, MA, USA.
| |
Collapse
|
6
|
Yuan C, Wang Y, Shang N, Li Z, Zhao R, Weng C. A graph-based method for reconstructing entities from coordination ellipsis in medical text. J Am Med Inform Assoc 2020; 27:1364-1373. [PMID: 32719840 DOI: 10.1093/jamia/ocaa109] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Revised: 04/21/2020] [Accepted: 05/12/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Coordination ellipsis is a linguistic phenomenon abound in medical text and is challenging for concept normalization because of difficulty in recognizing elliptical expressions referencing 2 or more entities accurately. To resolve this bottleneck, we aim to contribute a generalizable method to reconstruct concepts from medical coordinated elliptical expressions in a variety of biomedical corpora. MATERIALS AND METHODS We proposed a graph-based representation model and built a pipeline to reconstruct concepts from coordinated elliptical expressions in medical text (RECEEM). There are 4 modules: (1) identify all possible candidate conjunct pairs from original coordinated elliptical expressions, (2) calculate coefficients for candidate conjuncts using the embedding model, (3) select the most appropriate decompositions by global optimization, and (4) rebuild concepts based on a pathfinding algorithm. We evaluated the pipeline's performance on 2658 coordinated elliptical expressions from 3 different medical corpora (ie, biomedical literature, clinical narratives, and eligibility criteria from clinical trials). Precision, recall, and F1 score were calculated. RESULTS The F1 scores for biomedical publications, clinical narratives, and research eligibility criteria were 0.862, 0.721, and 0.870, respectively. RECEEM outperformed 2 previously released methods. By incorporating RECEEM into 2 existing NLP tools, the F1 scores increased from 0.248 to 0.460 and from 0.287 to 0.630 on concept mapping of 1125 coordination ellipses. CONCLUSIONS RECEEM improves concept normalization for medical coordinated elliptical expressions in a variety of biomedical corpora. It outperformed existing methods and significantly enhanced the performance of 2 notable NLP systems for mapping coordination ellipses in the evaluation. The algorithm is open sourced online (https://github.com/chiyuan1126/RECEEM).
Collapse
Affiliation(s)
- Chi Yuan
- Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China.,Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Yongli Wang
- Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China
| | - Ning Shang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Ziran Li
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Ruxin Zhao
- Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
7
|
Abstract
INTRODUCTION Artificial intelligence (AI) technologies continue to attract interest from a broad range of disciplines in recent years, including health. The increase in computer hardware and software applications in medicine, as well as digitization of health-related data together fuel progress in the development and use of AI in medicine. This progress provides new opportunities and challenges, as well as directions for the future of AI in health. OBJECTIVE The goals of this survey are to review the current state of AI in health, along with opportunities, challenges, and practical implications. This review highlights recent developments over the past five years and directions for the future. METHODS Publications over the past five years reporting the use of AI in health in clinical and biomedical informatics journals, as well as computer science conferences, were selected according to Google Scholar citations. Publications were then categorized into five different classes, according to the type of data analyzed. RESULTS The major data types identified were multi-omics, clinical, behavioral, environmental and pharmaceutical research and development (R&D) data. The current state of AI related to each data type is described, followed by associated challenges and practical implications that have emerged over the last several years. Opportunities and future directions based on these advances are discussed. CONCLUSION Technologies have enabled the development of AI-assisted approaches to healthcare. However, there remain challenges. Work is currently underway to address multi-modal data integration, balancing quantitative algorithm performance and qualitative model interpretability, protection of model security, federated learning, and model bias.
Collapse
Affiliation(s)
- Fei Wang
- Division of Health Informatics, Department of Healthcare Policy and Research, Weill Cornell Medicine, Cornell University, NY, USA
| | | |
Collapse
|
8
|
Gero Z, Ho J. PMCVec: Distributed phrase representation for biomedical text processing. J Biomed Inform 2019; 100S:100047. [DOI: 10.1016/j.yjbinx.2019.100047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 06/28/2019] [Accepted: 07/05/2019] [Indexed: 10/26/2022]
|
9
|
A reference set of curated biomedical data and metadata from clinical case reports. Sci Data 2018; 5:180258. [PMID: 30457569 PMCID: PMC6244181 DOI: 10.1038/sdata.2018.258] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 09/27/2018] [Indexed: 12/30/2022] Open
Abstract
Clinical case reports (CCRs) provide an important means of sharing clinical experiences about atypical disease phenotypes and new therapies. However, published case reports contain largely unstructured and heterogeneous clinical data, posing a challenge to mining relevant information. Current indexing approaches generally concern document-level features and have not been specifically designed for CCRs. To address this disparity, we developed a standardized metadata template and identified text corresponding to medical concepts within 3,100 curated CCRs spanning 15 disease groups and more than 750 reports of rare diseases. We also prepared a subset of metadata on reports on selected mitochondrial diseases and assigned ICD-10 diagnostic codes to each. The resulting resource, Metadata Acquired from Clinical Case Reports (MACCRs), contains text associated with high-level clinical concepts, including demographics, disease presentation, treatments, and outcomes for each report. Our template and MACCR set render CCRs more findable, accessible, interoperable, and reusable (FAIR) while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.
Collapse
|