1
|
Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, Yang Y, Chen Q, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2023; 25:bbad493. [PMID: 38168838 PMCID: PMC10762511 DOI: 10.1093/bib/bbad493] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/15/2023] [Accepted: 12/06/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
Collapse
Affiliation(s)
- Shubo Tian
- National Library of Medicine, National Institutes of Health
| | - Qiao Jin
- National Library of Medicine, National Institutes of Health
| | - Lana Yeganova
- National Library of Medicine, National Institutes of Health
| | - Po-Ting Lai
- National Library of Medicine, National Institutes of Health
| | - Qingqing Zhu
- National Library of Medicine, National Institutes of Health
| | - Xiuying Chen
- King Abdullah University of Science and Technology
| | - Yifan Yang
- National Library of Medicine, National Institutes of Health
| | - Qingyu Chen
- National Library of Medicine, National Institutes of Health
| | - Won Kim
- National Library of Medicine, National Institutes of Health
| | | | | | - Aadit Kapoor
- National Library of Medicine, National Institutes of Health
| | - Xin Gao
- King Abdullah University of Science and Technology
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health
| |
Collapse
|
2
|
Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 2023; 39:btad651. [PMID: 37930897 PMCID: PMC10627406 DOI: 10.1093/bioinformatics/btad651] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 09/29/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. RESULTS To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks. AVAILABILITY AND IMPLEMENTATION The MedCPT code and model are available at https://github.com/ncbi/MedCPT.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| |
Collapse
|
3
|
Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, Yang Y, Chen Q, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. ArXiv 2023:arXiv:2306.10070v2. [PMID: 37904734 PMCID: PMC10614979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction, and medical education, and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
Collapse
Affiliation(s)
- Shubo Tian
- National Library of Medicine, National Institutes of Health
| | - Qiao Jin
- National Library of Medicine, National Institutes of Health
| | - Lana Yeganova
- National Library of Medicine, National Institutes of Health
| | - Po-Ting Lai
- National Library of Medicine, National Institutes of Health
| | - Qingqing Zhu
- National Library of Medicine, National Institutes of Health
| | - Xiuying Chen
- King Abdullah University of Science and Technology
| | - Yifan Yang
- National Library of Medicine, National Institutes of Health
| | - Qingyu Chen
- National Library of Medicine, National Institutes of Health
| | - Won Kim
- National Library of Medicine, National Institutes of Health
| | | | | | - Aadit Kapoor
- National Library of Medicine, National Institutes of Health
| | - Xin Gao
- King Abdullah University of Science and Technology
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health
| |
Collapse
|
4
|
Kim W, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. Towards a unified search: Improving PubMed retrieval with full text. J Biomed Inform 2022; 134:104211. [PMID: 36152950 PMCID: PMC9561061 DOI: 10.1016/j.jbi.2022.104211] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 09/12/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]
Abstract
OBJECTIVE A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
Collapse
Affiliation(s)
- Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
| |
Collapse
|
5
|
Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, Madej T, Marchler-Bauer A, Lanczycki C, Lathrop S, Lu Z, Thibaud-Nissen F, Murphy T, Phan L, Skripchenko Y, Tse T, Wang J, Williams R, Trawick BW, Pruitt KD, Sherry ST. Database resources of the national center for biotechnology information. Nucleic Acids Res 2021; 50:D20-D26. [PMID: 34850941 DOI: 10.1093/nar/gkab1112] [Citation(s) in RCA: 711] [Impact Index Per Article: 237.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 10/20/2021] [Accepted: 11/18/2021] [Indexed: 11/14/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) produces a variety of online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the most of these databases. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, RefSeq, SRA, Virus, dbSNP, dbVar, ClinicalTrials.gov, MMDB, iCn3D and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Eric W Sayers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathi Canese
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jessica Chan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ryan Connor
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathryn Funk
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chris Kelly
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Tom Madej
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Christopher Lanczycki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stacy Lathrop
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Terence Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yuri Skripchenko
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Tony Tse
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jiyao Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Rebecca Williams
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Barton W Trawick
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stephen T Sherry
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
6
|
Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021; 8:91. [PMID: 33767203 PMCID: PMC7994842 DOI: 10.1038/s41597-021-00875-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 01/19/2021] [Indexed: 11/13/2022] Open
Abstract
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Sun Kim
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dongseop Kwon
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Donald C Comeau
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Yifan Peng
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Cathleen Coss
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Carol Fisher
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Rob Guzman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Preeti Gokal Kochar
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Stella Koppel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dorothy Trinh
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Deborah Whitman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Susan Schmidt
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
7
|
Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, Comeau DC, Funk K, Kim S, Klimke W, Marchler-Bauer A, Landrum M, Lathrop S, Lu Z, Madden TL, O’Leary N, Phan L, Rangwala SH, Schneider VA, Skripchenko Y, Wang J, Ye J, Trawick BW, Pruitt KD, Sherry ST. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2021; 49:D10-D17. [PMID: 33095870 PMCID: PMC7778943 DOI: 10.1093/nar/gkaa892] [Citation(s) in RCA: 410] [Impact Index Per Article: 136.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 09/25/2020] [Accepted: 10/08/2020] [Indexed: 11/14/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 34 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface and NCBI datasets. Additional resources that were updated in the past year include PMC, Bookshelf, Genome Data Viewer, SRA, ClinVar, dbSNP, dbVar, Pathogen Detection, BLAST, Primer-BLAST, IgBLAST, iCn3D and PubChem. All of these resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Eric W Sayers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jeffrey Beck
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Devon Bourexis
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - James R Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathi Canese
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathryn Funk
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - William Klimke
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Melissa Landrum
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stacy Lathrop
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Thomas L Madden
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Nuala O’Leary
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sanjida H Rangwala
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yuri Skripchenko
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jiyao Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jian Ye
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Barton W Trawick
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stephen T Sherry
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
8
|
Comeau DC, Wei CH, Islamaj Doğan R, Lu Z. PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics 2020; 35:3533-3535. [PMID: 30715220 DOI: 10.1093/bioinformatics/btz070] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 01/17/2018] [Accepted: 01/28/2019] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Interest in text mining full-text biomedical research articles is growing. To facilitate automated processing of nearly 3 million full-text articles (in PubMed Central® Open Access and Author Manuscript subsets) and to improve interoperability, we convert these articles to BioC, a community-driven simple data structure in either XML or JavaScript Object Notation format for conveniently sharing text and annotations. RESULTS The resultant articles can be downloaded via both File Transfer Protocol for bulk access and a Web API for updates or a more focused collection. Since the availability of the Web API in 2017, our BioC collection has been widely used by the research community. AVAILABILITY AND IMPLEMENTATION https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/.
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
9
|
Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, Lu Z. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res 2020; 47:W594-W599. [PMID: 31020319 DOI: 10.1093/nar/gkz289] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Revised: 04/05/2019] [Accepted: 04/10/2019] [Indexed: 11/15/2022] Open
Abstract
Literature search is a routine practice for scientific studies as new discoveries build on knowledge from the past. Current tools (e.g. PubMed, PubMed Central), however, generally require significant effort in query formulation and optimization (especially in searching the full-length articles) and do not allow direct retrieval of specific statements, which is key for tasks such as comparing/validating new findings with previous knowledge and performing evidence attribution in biocuration. Thus, we introduce LitSense, which is the first web-based system that specializes in sentence retrieval for biomedical literature. LitSense provides unified access to PubMed and PMC content with over a half-billion sentences in total. Given a query, LitSense returns best-matching sentences using both a traditional term-weighting approach that up-weights sentences that contain more of the rare terms in the user query as well as a novel neural embedding approach that enables the retrieval of semantically relevant results without explicit keyword match. LitSense provides a user-friendly interface that assists its users to quickly browse the returned sentences in context and/or further filter search results by section or publication date. LitSense also employs PubTator to highlight biomedical entities (e.g. gene/proteins) in the sentences for better result visualization. LitSense is freely available at https://www.ncbi.nlm.nih.gov/research/litsense.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Roberto Vera Alvarez
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
10
|
Sayers EW, Beck J, Brister JR, Bolton EE, Canese K, Comeau DC, Funk K, Ketter A, Kim S, Kimchi A, Kitts PA, Kuznetsov A, Lathrop S, Lu Z, McGarvey K, Madden TL, Murphy TD, O'Leary N, Phan L, Schneider VA, Thibaud-Nissen F, Trawick BW, Pruitt KD, Ostell J. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2020; 48:D9-D16. [PMID: 31602479 DOI: 10.1093/nar/gkz899] [Citation(s) in RCA: 267] [Impact Index Per Article: 66.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 10/09/2019] [Indexed: 11/14/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts published in life science journals. The Entrez system provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Custom implementations of the BLAST program provide sequence-based searching of many specialized datasets. New resources released in the past year include a new PubMed interface, a sequence database search and a gene orthologs page. Additional resources that were updated in the past year include PMC, Bookshelf, My Bibliography, Assembly, RefSeq, viral genomes, the prokaryotic genome annotation pipeline, Genome Workbench, dbSNP, BLAST, Primer-BLAST, IgBLAST and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Eric W Sayers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jeff Beck
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathi Canese
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathryn Funk
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Anne Ketter
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Avi Kimchi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Paul A Kitts
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Anatoliy Kuznetsov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stacy Lathrop
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kelly McGarvey
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Thomas L Madden
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Nuala O'Leary
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Bart W Trawick
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - James Ostell
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
11
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
12
|
Kim S, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. PubMed Phrases, an open set of coherent phrases for searching biomedical literature. Sci Data 2018; 5:180104. [PMID: 29893755 PMCID: PMC5996850 DOI: 10.1038/sdata.2018.104] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Accepted: 04/06/2018] [Indexed: 11/09/2022] Open
Abstract
In biomedicine, key concepts are often expressed by multiple words (e.g., ‘zinc finger protein’). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
13
|
Yeganova L, Kim W, Comeau DC, Wilbur WJ, Lu Z. A Field Sensor: computing the composition and intent of PubMed queries. Database (Oxford) 2018; 2018:5053191. [PMID: 30010750 PMCID: PMC6044290 DOI: 10.1093/database/bay052] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2017] [Revised: 04/19/2018] [Accepted: 05/17/2018] [Indexed: 11/13/2022]
Abstract
PubMed® is a search engine providing access to a collection of over 27 million biomedical bibliographic records as of 2017. PubMed processes millions of queries a day, and understanding these queries is one of the main building blocks for successful information retrieval. In this work, we present Field Sensor, a domain-specific tool for understanding the composition and predicting the user intent of PubMed queries. Given a query, the Field Sensor infers a field for each token or sequence of tokens in a query in multi-step process that includes syntactic chunking, rule-based tagging and probabilistic field prediction. In this work, the fields of interest are those associated with (meta-)data elements of each PubMed record such as article title, abstract, author name(s), journal title, volume, issue, page and date. We evaluate the accuracy of our algorithm on a human-annotated corpus of 10 000 PubMed queries, as well as a new machine-annotated set of 103 000 PubMed queries. The Field Sensor achieves an accuracy of 93 and 91% on the two corresponding corpora and finds that nearly half of all searches are navigational (e.g. author searches, article title searches etc.) and half are informational (e.g. topical searches). The Field Sensor has been integrated into PubMed since June 2017 to detect informational queries for which results sorted by relevance can be suggested as an alternative to those sorted by the default date sort. In addition, the composition of PubMed queries as computed by the Field Sensor proves to be essential for understanding how users query PubMed.
Collapse
Affiliation(s)
- Lana Yeganova
- National Center for Biotechnology Information (NCBI) / National Library of Medicine (NLM) at the National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Won Kim
- National Center for Biotechnology Information (NCBI) / National Library of Medicine (NLM) at the National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI) / National Library of Medicine (NLM) at the National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI) / National Library of Medicine (NLM) at the National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) / National Library of Medicine (NLM) at the National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
14
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. Database (Oxford) 2017; 2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.,Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada
| |
Collapse
|
15
|
Kim S, Islamaj Doğan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Batista-Navarro R, Carter J, Ananiadou S, Matos S, Santos A, Campos D, Oliveira JL, Singh O, Jonnagaddala J, Dai HJ, Su ECY, Chang YC, Su YC, Chu CH, Chen CC, Hsu WL, Peng Y, Arighi C, Wu CH, Vijay-Shanker K, Aydın F, Hüsünbeyi ZM, Özgür A, Shin SY, Kwon D, Dolinski K, Tyers M, Wilbur WJ, Comeau DC. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford) 2016; 2016:baw121. [PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Accepted: 08/02/2016] [Indexed: 11/14/2022]
Abstract
BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-1-bioc/.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Jacob Carter
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - André Santos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - David Campos
- BMD Software, Lda, Rua Calouste Gulbenkian 1, 3810-074 Aveiro, Portugal
| | - José Luís Oliveira
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Onkar Singh
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, University of New South Wales, Kensington NSW 2033, Australia Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2033, Australia
| | - Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Emily Chia-Yu Su
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Yung-Chun Chang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Yu-Chen Su
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Chun-Han Chu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Chien Chin Chen
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yifan Peng
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Cecilia Arighi
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | - Cathy H Wu
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | - K Vijay-Shanker
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Ferhat Aydın
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Zehra Melce Hüsünbeyi
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Soo-Yong Shin
- Department of Biomedical Informatics, Asan Medical Center, 138-736 Seoul, South Korea
| | - Dongseop Kwon
- Department of Computer Engineering, Myongji University, 449-728 Yongin, South Korea
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
16
|
Liu H, Verspoor K, Comeau DC, MacKinlay AD, Wilbur W. Optimizing graph-based patterns to extract biomedical events from the literature. BMC Bioinformatics 2015; 16 Suppl 16:S2. [PMID: 26551594 PMCID: PMC4642081 DOI: 10.1186/1471-2105-16-s16-s2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
In BioNLP-ST 2013 We participated in the BioNLP 2013 shared tasks on event extraction. Our extraction method is based on the search for an approximate subgraph isomorphism between key context dependencies of events and graphs of input sentences. Our system was able to address both the GENIA (GE) task focusing on 13 molecular biology related event types and the Cancer Genetics (CG) task targeting a challenging group of 40 cancer biology related event types with varying arguments concerning 18 kinds of biological entities. In addition to adapting our system to the two tasks, we also attempted to integrate semantics into the graph matching scheme using a distributional similarity model for more events, and evaluated the event extraction impact of using paths of all possible lengths as key context dependencies beyond using only the shortest paths in our system. We achieved a 46.38% F-score in the CG task (ranking 3rd) and a 48.93% F-score in the GE task (ranking 4th). After BioNLP-ST 2013 We explored three ways to further extend our event extraction system in our previously published work: (1) We allow non-essential nodes to be skipped, and incorporated a node skipping penalty into the subgraph distance function of our approximate subgraph matching algorithm. (2) Instead of assigning a unified subgraph distance threshold to all patterns of an event type, we learned a customized threshold for each pattern. (3) We implemented the well-known Empirical Risk Minimization (ERM) principle to optimize the event pattern set by balancing prediction errors on training data against regularization. When evaluated on the official GE task test data, these extensions help to improve the extraction precision from 62% to 65%. However, the overall F-score stays equivalent to the previous performance due to a 1% drop in recall.
Collapse
|
17
|
Comeau DC, Batista-Navarro RT, Dai HJ, Doğan RI, Yepes AJ, Khare R, Lu Z, Marques H, Mattingly CJ, Neves M, Peng Y, Rak R, Rinaldi F, Tsai RTH, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. BioC interoperability track overview. Database (Oxford) 2014; 2014:bau053. [PMID: 24980129 PMCID: PMC4074764 DOI: 10.1093/database/bau053] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
BioC is a new simple XML format for sharing biomedical text and annotations and libraries to read and write that format. This promotes the development of interoperable tools for natural language processing (NLP) of biomedical text. The interoperability track at the BioCreative IV workshop featured contributions using or highlighting the BioC format. These contributions included additional implementations of BioC, many new corpora in the format, biomedical NLP tools consuming and producing the format and online services using the format. The ease of use, broad support and rapidly growing number of tools demonstrate the need for and value of the BioC format. Database URL:http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Riza Theresa Batista-Navarro
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Hong-Jie Dai
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Antonio Jimeno Yepes
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Ritu Khare
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Hernani Marques
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Carolyn J Mattingly
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Mariana Neves
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - Yifan Peng
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Rafal Rak
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Fabio Rinaldi
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Richard Tzong-Han Tsai
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Karin Verspoor
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - Thomas C Wiegers
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Cathy H Wu
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USANational Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Univers
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, National Centre for Text Mining and School of Computer Science, University of Manchester, Manchester M1 7DN, UK, Graduate Institute of BioMedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C., Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria Australia 3010, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695-7617, USA, WBI, Institute for Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany, Berlin Brandenburg Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin 13353, Germany, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan, R.O.C., Health and Biomedical Informatics Centre, The University of Melbourne, Parkville, Victoria Australia 3010, Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| |
Collapse
|
18
|
Abstract
As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site: http://bioc.sourceforge.net. Database URL:http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Wanli Liu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - Dongseop Kwon
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - Hernani Marques
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - Fabio Rinaldi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA, Department of Computer Engineering, Myongji University, Yongin, Republic of Korea and Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland
| |
Collapse
|
19
|
Comeau DC, Liu H, Islamaj Doğan R, Wilbur WJ. Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus. Database (Oxford) 2014; 2014:bau056. [PMID: 24935050 PMCID: PMC4058794 DOI: 10.1093/database/bau056] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL:http://bioc.sourceforge.net
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Haibin Liu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
20
|
Islamaj Doğan R, Comeau DC, Yeganova L, Wilbur WJ. Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014; 2014:bau044. [PMID: 24914232 PMCID: PMC4051513 DOI: 10.1093/database/bau044] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information—that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed® citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net.
Collapse
Affiliation(s)
- Rezarta Islamaj Doğan
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
21
|
Abstract
Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.
Collapse
Affiliation(s)
- Wanli Liu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Won Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD, 20894
| |
Collapse
|
22
|
Comeau DC, Islamaj Doğan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford) 2013; 2013:bat064. [PMID: 24048470 PMCID: PMC3889917 DOI: 10.1093/database/bat064] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, Harvard Medical School, Harvard University, Boston, MA 02115 USA, Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO 80045, USA, Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid E-28029, Spain, Center for Bioinformatics and Computational Biology, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, National ICT Australia (NICTA), Victoria Research Laboratory, The University of Melbourne, Parkville VIC 3010, Australia and Department of Biology, North Carolina State University, Raleigh, NC 27695, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Abstract
Background There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. Results We study and compare these two alternative sets of terms to identify semantic categories in Medline. We find that both approaches produce reasonable terms as potential categories. We also find that there is a significant agreement between the two sets of terms. The overlap between the two methods improves our confidence regarding categories predicted by these independent methods. Conclusions This study is an initial attempt to extract categories that are discussed in Medline. Rather than imposing external ontologies on Medline, our methods allow categories to emerge from the text.
Collapse
Affiliation(s)
- Lana Yeganova
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | | | |
Collapse
|
24
|
Kim W, Yeganova L, Comeau DC, Wilbur WJ. Identifying well-formed biomedical phrases in MEDLINE® text. J Biomed Inform 2012; 45:1035-41. [PMID: 22683889 DOI: 10.1016/j.jbi.2012.05.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2012] [Revised: 05/22/2012] [Accepted: 05/25/2012] [Indexed: 11/26/2022]
Abstract
In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.
Collapse
Affiliation(s)
- Won Kim
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | | | |
Collapse
|
25
|
Abstract
Background The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data. Methods In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. Results We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.
Collapse
Affiliation(s)
- Lana Yeganova
- National Center for Biotechnology Information, NLM, NIH, Bethesda, MD, USA.
| | | | | |
Collapse
|
26
|
|
27
|
Abstract
A significant fraction of queries in PubMed™ are multiterm queries without parsing instructions. Generally, search engines interpret such queries as collections of terms, and handle them as a Boolean conjunction of these terms. However, analysis of queries in PubMed™ indicates that many such queries are meaningful phrases, rather than simple collections of terms. In this study, we examine whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that the class of records that contain all the search terms, but not the phrase, qualitatively differs from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching.
Collapse
Affiliation(s)
- Lana Yeganova
- Contractor, Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894
| | - Donald C Comeau
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894
| | - Won Kim
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894
| | - W John Wilbur
- Principal Investigator, Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A, 8600 Rockville Pike, Bethesda, MD 20894
| |
Collapse
|
28
|
Sohn S, Comeau DC, Kim W, Wilbur WJ. Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics 2008; 9:402. [PMID: 18817555 PMCID: PMC2576267 DOI: 10.1186/1471-2105-9-402] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2008] [Accepted: 09/25/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. RESULTS On the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm. CONCLUSION We developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.
Collapse
Affiliation(s)
- Sunghwan Sohn
- National Centre for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | | | | | |
Collapse
|
29
|
Abstract
OBJECTIVES The aim of this study was to improve naïve Bayes prediction of Medical Subject Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired method. DESIGN The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C1 class) and those documents not including a given MeSH term (C(-1) class) that are closest to the C1 class. These small sets were used to predict MeSH assignments in the MEDLINE database. MEASUREMENTS Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier. RESULTS For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes. CONCLUSION Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible.
Collapse
Affiliation(s)
- Sunghwan Sohn
- National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD 20894, USA.
| | | | | | | |
Collapse
|
30
|
Tanabe L, Thom LH, Matten W, Comeau DC, Wilbur WJ. SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc 2006; 2006:754-8. [PMID: 17238442 PMCID: PMC1839293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
We describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to genomics. SemCat can be used to facilitate natural language processing in MEDLINE. We present suitable application areas including biomedical name classification and named entity recognition.
Collapse
Affiliation(s)
- Lorraine Tanabe
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
31
|
Comeau DC, Shavitt I, Jensen P, Bunker PR. Anabinitiodetermination of the potential‐energy surfaces and rotation–vibration energy levels of methylene in the lowest triplet and singlet states and the singlet–triplet splitting. J Chem Phys 1989. [DOI: 10.1063/1.456315] [Citation(s) in RCA: 102] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|