1
|
Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024; 16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]
Abstract
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
Collapse
Affiliation(s)
- Dao-Ling Huang
- BGI Research, Shenzhen, 518083, China.
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Quanlei Zeng
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yun Xiong
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Shuixia Liu
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Chaoqun Pang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Menglei Xia
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Ting Fang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yanli Ma
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Cuicui Qiang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yi Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yu Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Hong Li
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yuying Yuan
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
2
|
Asadi-Pooya AA, Malekpour M, Taherifard E, Mallahzadeh A, Farjoud Kouhanjani M. Coexistence of temporal lobe epilepsy and idiopathic generalized epilepsy. Epilepsy Behav 2024; 151:109602. [PMID: 38160579 DOI: 10.1016/j.yebeh.2023.109602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Revised: 12/07/2023] [Accepted: 12/21/2023] [Indexed: 01/03/2024]
Abstract
OBJECTIVE We investigated the frequency of coexistence of temporal lobe epilepsy (TLE) and idiopathic generalized epilepsy (IGE) in a retrospective database study. We also explored the underlying pathomechanisms of the coexistence of TLE and IGE based on the available information, using bioinformatics tools. METHODS The first phase of the investigation was a retrospective study. All patients with an electro-clinical diagnosis of epilepsy were studied at the outpatient epilepsy clinic at Shiraz University of Medical Sciences, Shiraz, Iran, from 2008 until 2023. In the second phase, we searched the following databases for genetic variations (epilepsy-associated genetic polymorphisms) that are associated with TLE or syndromes of IGE: DisGeNET, genome-wide association study (GWAS) Catalog, epilepsy genetic association database (epiGAD), and UniProt. We also did a separate literature search using PubMed. RESULTS In total, 3760 patients with epilepsy were registered at our clinic; four patients with definitely mixed TLE and IGE were identified; 0.1% of all epilepsies. We could identify that rs1883415 of ALDH5A1, rs137852779 of EFHC1, rs211037 of GABRG2, rs1130183 of KCNJ10, and rs1045642 of ABCB1 genes are shared between TLE and syndromes of IGE. CONCLUSION While coexistence of TLE and IGE is a rare phenomenon, this could be explained by shared genetic variations.
Collapse
Affiliation(s)
- Ali A Asadi-Pooya
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran; Jefferson Comprehensive Epilepsy Center, Department of Neurology, Thomas Jefferson University, Philadelphia, PA, USA.
| | - Mahdi Malekpour
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Ehsan Taherifard
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Arashk Mallahzadeh
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | | |
Collapse
|
3
|
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open
Abstract
Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
4
|
Wu Z, Feng C, Hu Y, Zhou Y, Li S, Zhang S, Hu Y, Chen Y, Chao H, Ni Q, Chen M. HALD, a human aging and longevity knowledge graph for precision gerontology and geroscience analyses. Sci Data 2023; 10:851. [PMID: 38040715 PMCID: PMC10692171 DOI: 10.1038/s41597-023-02781-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 11/23/2023] [Indexed: 12/03/2023] Open
Abstract
Human aging is a natural and inevitable biological process that leads to an increased risk of aging-related diseases. Developing anti-aging therapies for aging-related diseases requires a comprehensive understanding of the mechanisms and effects of aging and longevity from a multi-modal and multi-faceted perspective. However, most of the relevant knowledge is scattered in the biomedical literature, the volume of which reached 36 million in PubMed. Here, we presented HALD, a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed. HALD integrated multiple state-of-the-art natural language processing (NLP) techniques to improve the accuracy and coverage of the knowledge graph for precision gerontology and geroscience analyses. Up to September 2023, HALD had contained 12,227 entities in 10 types (gene, RNA, protein, carbohydrate, lipid, peptide, pharmaceutical preparations, toxin, mutation, and disease), 115,522 relations, 1,855 aging biomarkers, and 525 longevity biomarkers from 339,918 biomedical articles in PubMed. HALD is available at https://bis.zju.edu.cn/hald .
Collapse
Affiliation(s)
- Zexu Wu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Cong Feng
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- The First Affiliated Hospital, Zhejiang University School of Medicine; Institute of Hematology, Zhejiang University, Hangzhou, 310058, China
| | - Yanshi Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yincong Zhou
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
- Joint Research Centre for Engineering Biology, Zhejiang University-University of Edinburgh Institute, Zhejiang University, Haining, 314400, China
| | - Sida Li
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Shilong Zhang
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yueming Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yuhao Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Haoyu Chao
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Qingyang Ni
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
- The First Affiliated Hospital, Zhejiang University School of Medicine; Institute of Hematology, Zhejiang University, Hangzhou, 310058, China.
- Joint Research Centre for Engineering Biology, Zhejiang University-University of Edinburgh Institute, Zhejiang University, Haining, 314400, China.
| |
Collapse
|
5
|
Xu Q, Liu Y, Sun D, Huang X, Li F, Zhai J, Li Y, Zhou Q, Qian N, Niu B. OncoCTMiner: streamlining precision oncology trial matching via molecular profile analysis. Database (Oxford) 2023; 2023:baad077. [PMID: 37935585 PMCID: PMC10630409 DOI: 10.1093/database/baad077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 09/08/2023] [Accepted: 10/21/2023] [Indexed: 11/09/2023]
Abstract
By establishing omics sequencing of patient tumors as a crucial element in cancer treatment, the extensive implementation of precision oncology necessitates effective and prompt execution of clinical studies for approving molecular-targeted therapies. However, the substantial volume of patient sequencing data, combined with strict clinical trial criteria, increasingly complicates the process of matching patients to precision oncology studies. To streamline enrollment in these studies, we developed OncoCTMiner, an automated pre-screening platform for molecular cancer clinical trials. Through manual tagging of eligibility criteria for 2227 oncology trials, we identified key bio-concepts such as cancer types, genes, alterations, drugs, biomarkers and therapies. Utilizing this manually annotated corpus along with open-source biomedical natural language processing tools, we trained multiple named entity recognition models specifically designed for precision oncology trials. These models analyzed 460 952 clinical trials, revealing 8.15 million precision medicine concepts, 9.32 million entity-criteria-trial triplets and a comprehensive precision oncology eligibility criteria database. Most significantly, we developed a patient-trial matching system based on cancer patients' clinical and genetic profiles, which can seamlessly integrate with the omics data analysis platform. This system expedites the pre-screening process for potentially suitable precision oncology trials, offering patients swifter access to promising treatment options. Database URL https://oncoctminer.chosenmedinfo.com.
Collapse
Affiliation(s)
- Quan Xu
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
- Research and Development Center, ChosenMed Technology (Zhejiang) Co. Ltd., Room 101, Building 8, Jincheng International Science and Technology City, No. 26 Zhenxing East Road, Linping District, Hangzhou, 311103, China
| | - Yueyue Liu
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
| | - Dawei Sun
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
- Research and Development Center, ChosenMed Technology (Zhejiang) Co. Ltd., Room 101, Building 8, Jincheng International Science and Technology City, No. 26 Zhenxing East Road, Linping District, Hangzhou, 311103, China
| | - Xiaoqian Huang
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
| | - Feihong Li
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
| | - JinCheng Zhai
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
| | - Yang Li
- Beijing International Center for Mathematical Research, Peking University, No. 5 Yiheyuan Road Haidian District, Beijing 100871, China
- Chongqing Research Institute of Big Data, Peking University, Chongqing 401333, China
| | - Qiming Zhou
- Department of Bioinformatics, Beijing ChosenMed Clinical Laboratory Co. Ltd., Jinghai Industrial Park, 156 Jinghai 4th Road, Economic and Technological Development Area, Beijing 100176, China
- Research and Development Center, ChosenMed Technology (Zhejiang) Co. Ltd., Room 101, Building 8, Jincheng International Science and Technology City, No. 26 Zhenxing East Road, Linping District, Hangzhou, 311103, China
| | - Niansong Qian
- Department of Oncology, Senior Department of Respiratory and Critical Care Medicine, The Eighth Medical Center of Chinese PLA General Hospital, No.17 A Heishanhu Road, Haidian District, Beijing 100853, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
6
|
Hussain M, Muhammad K, Khan M, Din AU. A Novel CRYBB2 Silent Variant in Autosomal Dominant Congenital Cataracts (ADCC) in Pakistani families. Pak J Med Sci 2023; 39:1399-1405. [PMID: 37680813 PMCID: PMC10480720 DOI: 10.12669/pjms.39.5.7061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 10/30/2022] [Accepted: 06/18/2023] [Indexed: 09/09/2023] Open
Abstract
Objective Congenital Cataract is a type of ophthalmic genetic disorder that appears at birth or in early childhood. Among 30 genes, CRYBB2 is one of the most common and a water-soluble protein of lens's that code for the βB2-crystallin. This study aimed to investigate the novel silent mutation in CRYBB2 of exon six in the Pakistani families of Autosomal Dominant Congenital Cataracts (ADCC). Methods It is a family-based study that presents three to five-generations of two Pakistani families. Data and blood samples from the families were collected from January to August 2019 from LRBT (Layton Rahmatullah Benevolent Trust) Hospital, Mansehra, Pakistan. We only included patients >15 years old. Before enrollment in the current study, each patient obtained a thorough optical examination. Samples were moved to the molecular lab using the collection and storage method. The phenol-chloroform technique was used to extract the DNA. The technique of Sanger sequencing was used to find any potential mutation in some of the selected families. Statistical and bioinformatics analysis were carried out. Results By using bioinformatics tools, the novel silent mutation was identified. Heterozygous silent mutation of CRYBB2 of exon 6 (c. 495G>A) was detected by the alignment of sequences. Computational prediction program did not predict the silent mutation. Conclusion This study investigated a novel important sequence variant in the beta-crystalline protein that causes autosomal dominant congenital cataract (ADCC) in Pakistani families. Thus, our study enlarges the CRYBB2 mutation spectrum and associated phenotypes to help clinical diagnosis of human genetic diseases.
Collapse
Affiliation(s)
- Maryam Hussain
- Maryam Hussain, M.Phil. Department of Biotechnology and Genetic Engineering, Hazara University Mansehra, 21120, Khyber Pakhtunkhwa, Pakistan
| | - Khushi Muhammad
- Khushi Muhammad, PhD. Associate Professor, Department of Life Science, Imperial College London, Sir Alex Fleming Building South, Kensington Campus London, SW7 2AZ, United Kingdom
| | - Muhammad Khan
- Muhammad Khan, PhD. Assistant Professor, Department of Biotechnology and Genetic Engineering, Hazara University Mansehra, 21120, Khyber Pakhtunkhwa, Pakistan
| | - Aziz Ud Din
- Aziz Ud Din, PhD. Assistant Professor, Department of Biotechnology and Genetic Engineering, Hazara University Mansehra, 21120, Khyber Pakhtunkhwa, Pakistan
| |
Collapse
|
7
|
Pu Y, Beck D, Verspoor K. Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease. J Biomed Inform 2023; 145:104464. [PMID: 37541406 DOI: 10.1016/j.jbi.2023.104464] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 07/29/2023] [Accepted: 07/30/2023] [Indexed: 08/06/2023]
Abstract
OBJECTIVE We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology. METHODS We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed. RESULTS We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation. CONCLUSION Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases. AVAILABILITY Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.
Collapse
Affiliation(s)
- Yiyuan Pu
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Daniel Beck
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia.
| |
Collapse
|
8
|
Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023; 2023:558-564. [PMID: 38283164 PMCID: PMC10815931 DOI: 10.1109/ichi57859.2023.00100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.
Collapse
Affiliation(s)
- Zenan Sun
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| |
Collapse
|
9
|
Nicholson DN, Alquaddoomi F, Rubinetti V, Greene CS. Changing word meanings in biomedical literature reveal pandemics and new technologies. BioData Min 2023; 16:16. [PMID: 37147665 PMCID: PMC10161184 DOI: 10.1186/s13040-023-00332-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/24/2023] [Indexed: 05/07/2023] Open
Abstract
While we often think of words as having a fixed meaning that we use to describe a changing world, words are also dynamic and changing. Scientific research can also be remarkably fast-moving, with new concepts or approaches rapidly gaining mind share. We examined scientific writing, both preprint and pre-publication peer-reviewed text, to identify terms that have changed and examine their use. One particular challenge that we faced was that the shift from closed to open access publishing meant that the size of available corpora changed by over an order of magnitude in the last two decades. We developed an approach to evaluate semantic shift by accounting for both intra- and inter-year variability using multiple integrated models. This analysis revealed thousands of change points in both corpora, including for terms such as 'cas9', 'pandemic', and 'sars'. We found that the consistent change-points between pre-publication peer-reviewed and preprinted text are largely related to the COVID-19 pandemic. We also created a web app for exploration that allows users to investigate individual terms ( https://greenelab.github.io/word-lapse/ ). To our knowledge, our research is the first to examine semantic shift in biomedical preprints and pre-publication peer-reviewed text, and provides a foundation for future work to understand how terms acquire new meanings and how peer review affects this process.
Collapse
Affiliation(s)
- David N Nicholson
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelpia, PA, USA
| | - Faisal Alquaddoomi
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Vincent Rubinetti
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Casey S Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA.
| |
Collapse
|
10
|
Malekpour M, Salarikia SR, Kashkooli M, Asadi-Pooya AA. The genetic link between systemic autoimmune disorders and temporal lobe epilepsy: A bioinformatics study. Epilepsia Open 2023. [PMID: 36929812 DOI: 10.1002/epi4.12727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 03/11/2023] [Indexed: 03/18/2023] Open
Abstract
OBJECTIVE We aimed to explore the underlying pathomechanisms of the comorbidity between three common systemic autoimmune disorders (SADs) [i.e., insulin-dependent diabetes mellitus (IDDM), systemic lupus erythematosus (SLE), and rheumatoid arthritis (RA)] and temporal lobe epilepsy (TLE), using bioinformatics tools. We hypothesized that there are shared genetic variations among these four conditions. METHODS Different databases (DisGeNET, Harmonizome, and Enrichr) were searched to find TLE-associated genes with variants; their single nucleotide polymorphisms (SNPs) were gathered from the literature. We also did a separate literature search using PubMed with the following keywords for original articles: "TLE" or "Temporal lobe epilepsy" AND "genetic variation," "single nucleotide polymorphism," "SNP," or "genetic polymorphism." In the next step, the SNPs associated with TLE were searched in the LitVar database to find the shared gene variations with RA, SLE, and IDDM. RESULTS Ninety unique SNPs were identified to be associated with TLE. LitVar search identified two SNPs that were shared between TLE and all three SADs (i.e., IDDM, SLE, and RA). The first SNP was rs16944 on the Interleukin-1β (IL-1β) gene. The second genetic variation was ε4 variation of apolipoprotein E (APOE) gene. SIGNIFICANCE The shared genetic variations (i.e., rs16944 on the IL-1β gene and ε4 variation of the APOE gene) may explain the underlying pathomechanisms of the comorbidity between three common SADs (i.e., IDDM, SLE, and RA) and TLE. Exploring such shared genetic variations may help find targeted therapies for patients with TLE, especially those with drug-resistant seizures who also have comorbid SADs.
Collapse
Affiliation(s)
- Mahdi Malekpour
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | | | - Mohammad Kashkooli
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Ali A Asadi-Pooya
- Epilepsy Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
- Department of Neurology, Jefferson Comprehensive Epilepsy Center, Thomas Jefferson University, Philadelphia, Pennsylvania, USA
| |
Collapse
|
11
|
Nicholson DN, Himmelstein DS, Greene CS. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 2022; 15:26. [PMID: 36258252 PMCID: PMC9578183 DOI: 10.1186/s13040-022-00311-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Accepted: 09/17/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.
Collapse
Affiliation(s)
- David N. Nicholson
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Daniel S. Himmelstein
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Casey S. Greene
- grid.430503.10000 0001 0703 675XDepartment of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA
| |
Collapse
|
12
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
13
|
Wei CH, Allot A, Riehle K, Milosavljevic A, Lu Z. tmVar 3.0: an improved variant concept recognition and normalization tool. Bioinformatics 2022; 38:4449-4451. [PMID: 35904569 PMCID: PMC9477515 DOI: 10.1093/bioinformatics/btac537] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 07/07/2022] [Accepted: 07/27/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision. RESULT We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant-related entities (e.g. allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download. AVAILABILITY AND IMPLEMENTATION https://github.com/ncbi/tmVar3.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Kevin Riehle
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
14
|
Xu Q, Liu Y, Hu J, Duan X, Song N, Zhou J, Zhai J, Su J, Liu S, Chen F, Zheng W, Guo Z, Li H, Zhou Q, Niu B. OncoPubMiner: a platform for mining oncology publications. Brief Bioinform 2022; 23:6691792. [PMID: 36058206 DOI: 10.1093/bib/bbac383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 08/08/2022] [Accepted: 08/09/2022] [Indexed: 11/12/2022] Open
Abstract
Updated and expert-quality knowledge bases are fundamental to biomedical research. A knowledge base established with human participation and subject to multiple inspections is needed to support clinical decision making, especially in the growing field of precision oncology. The number of original publications in this field has risen dramatically with the advances in technology and the evolution of in-depth research. Consequently, the issue of how to gather and mine these articles accurately and efficiently now requires close consideration. In this study, we present OncoPubMiner (https://oncopubminer.chosenmedinfo.com), a free and powerful system that combines text mining, data structure customisation, publication search with online reading and project-centred and team-based data collection to form a one-stop 'keyword in-knowledge out' oncology publication mining platform. The platform was constructed by integrating all open-access abstracts from PubMed and full-text articles from PubMed Central, and it is updated daily. OncoPubMiner makes obtaining precision oncology knowledge from scientific articles straightforward and will assist researchers in efficiently developing structured knowledge base systems and bring us closer to achieving precision oncology goals.
Collapse
Affiliation(s)
- Quan Xu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Yueyue Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Jifang Hu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaohong Duan
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Niuben Song
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jiale Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jincheng Zhai
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Junyan Su
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Siyao Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Fan Chen
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Wei Zheng
- The Department of Nephrology and Hypertension Medicine, Beijing Electric Power Hospital, Beijing 100073, China
| | - Zhongjia Guo
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Hexiang Li
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Qiming Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Beifang Niu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
15
|
Sung M, Jeong M, Choi Y, Kim D, Lee J, Kang J. BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics 2022; 38:4837-4839. [PMID: 36053172 PMCID: PMC9563680 DOI: 10.1093/bioinformatics/btac598] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 07/09/2022] [Accepted: 08/31/2022] [Indexed: 11/14/2022] Open
Abstract
In biomedical natural language processing, named entity recognition (NER) and named entity normalization (NEN) are key tasks that enable the automatic extraction of biomedical entities (e.g. diseases and drugs) from the ever-growing biomedical literature. In this article, we present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by employing a multi-task NER model and neural network-based NEN models to achieve much faster and more accurate inference. We hope that our tool can help annotate large-scale biomedical texts for various tasks such as biomedical knowledge graph construction. Availability and implementation Web service of BERN2 is publicly available at http://bern2.korea.ac.kr. We also provide local installation of BERN2 at https://github.com/dmis-lab/BERN2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mujeen Sung
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Yonghwa Choi
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Donghyeon Kim
- AIRS Company, Hyundai Motor Group, Seoul, 06620, Republic of Korea
| | - Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.,AIGEN Sciences, Seoul, 04778, Republic of Korea
| |
Collapse
|
16
|
Goto A, Rodriguez-Esteban R, Scharf SH, Morris GM. Understanding the genetics of viral drug resistance by integrating clinical data and mining of the scientific literature. Sci Rep 2022; 12:14476. [PMID: 36008431 PMCID: PMC9403226 DOI: 10.1038/s41598-022-17746-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 07/30/2022] [Indexed: 11/16/2022] Open
Abstract
Drug resistance caused by mutations is a public health threat for existing and emerging viral diseases. A wealth of evidence about these mutations and their clinically associated phenotypes is scattered across the literature, but a comprehensive perspective is usually lacking. This work aimed to produce a clinically relevant view for the case of Hepatitis B virus (HBV) mutations by combining a chronic HBV clinical study with a compendium of genetic mutations systematically gathered from the scientific literature. We enriched clinical mutation data by systematically mining 2,472,725 scientific articles from PubMed Central in order to gather information about the HBV mutational landscape. By performing this analysis, we were able to identify mutational hotspots for each HBV genotype (A-E) and gene (C, X, P, S), as well as the location of disulfide bonds associated with these mutations. Through a modelling study, we also identified a mutation position common in both the clinical data and the literature that is located at the binding pocket for a known anti-HBV drug, namely entecavir. The results of this novel approach show the potential of integrated analyses to assist in the development of new drugs for viral diseases that are more robust to resistance. Such analyses should be of particular interest due to the increasing importance of viral resistance in established and emerging viruses, such as for newly developed drugs against SARS-CoV-2.
Collapse
Affiliation(s)
- An Goto
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK
| | | | | | - Garrett M Morris
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK.
| |
Collapse
|
17
|
Lin PC, Tsai YS, Yeh YM, Shen MR. Cutting-Edge AI Technologies Meet Precision Medicine to Improve Cancer Care. Biomolecules 2022; 12:1133. [PMID: 36009026 PMCID: PMC9405970 DOI: 10.3390/biom12081133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 08/11/2022] [Accepted: 08/15/2022] [Indexed: 11/18/2022] Open
Abstract
To provide precision medicine for better cancer care, researchers must work on clinical patient data, such as electronic medical records, physiological measurements, biochemistry, computerized tomography scans, digital pathology, and the genetic landscape of cancer tissue. To interpret big biodata in cancer genomics, an operational flow based on artificial intelligence (AI) models and medical management platforms with high-performance computing must be set up for precision cancer genomics in clinical practice. To work in the fast-evolving fields of patient care, clinical diagnostics, and therapeutic services, clinicians must understand the fundamentals of the AI tool approach. Therefore, the present article covers the following four themes: (i) computational prediction of pathogenic variants of cancer susceptibility genes; (ii) AI model for mutational analysis; (iii) single-cell genomics and computational biology; (iv) text mining for identifying gene targets in cancer; and (v) the NVIDIA graphics processing units, DRAGEN field programmable gate arrays systems and AI medical cloud platforms in clinical next-generation sequencing laboratories. Based on AI medical platforms and visualization, large amounts of clinical biodata can be rapidly copied and understood using an AI pipeline. The use of innovative AI technologies can deliver more accurate and rapid cancer therapy targets.
Collapse
Affiliation(s)
- Peng-Chan Lin
- Department of Oncology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
- Department of Genomic Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
| | - Yi-Shan Tsai
- Department of Medical Imaging, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
| | - Yu-Min Yeh
- Department of Oncology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
| | - Meng-Ru Shen
- Institute of Clinical Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
- Department of Obstetrics and Gynecology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
- Department of Pharmacology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
| |
Collapse
|
18
|
Wang C, Yu P, Hu L, Liang M, Mao Y, Zeng Q, Wang X, Huang K, Yan J, Xie L, Zhang F, Zhu F. Prevalence and prognosis of molecularly defined familial hypercholesterolemia in patients with acute coronary syndrome. Front Cardiovasc Med 2022; 9:921803. [PMID: 35966514 PMCID: PMC9363594 DOI: 10.3389/fcvm.2022.921803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 07/06/2022] [Indexed: 11/13/2022] Open
Abstract
Background Familial hypercholesterolemia (FH) can elevate serum low-density lipoprotein cholesterol (LDL-C) levels, which can promote the progression of acute coronary syndrome (ACS). However, the effect of FH on the prognosis of ACS remains unclear. Methods In this prospective cohort study, 223 patients with ACS having LDL-C ≥ 135.3 mg/dL (3.5 mmol/L) were enrolled and screened for FH using a multiple-gene FH panel. The diagnosis of FH was defined according to the ACMG/AMP criteria as carrying pathogenic or likely pathogenic variants. The clinical features of FH and the relationship of FH to the average 16.6-month risk of cardiovascular events (CVEs) were assessed. Results The prevalence of molecularly defined FH in enrolled patients was 26.9%, and coronary artery lesions were more severe in patients with FH than in those without (Gensini score 66.0 vs. 28.0, respectively; P < 0.001). After lipid lowering, patients with FH still had significantly higher LDL-C levels at their last visit (73.5 ± 25.9 mg/dL vs. 84.7 ± 37.1 mg/dL; P = 0.013) compared with those without. FH increased the incidence of CVEs in patients with ACS [hazard ratio (HR): 3.058; 95% confidence interval (CI): 1.585–5.900; log-rank P < 0.001]. Conclusion FH is associated with an increased risk of CVEs in ACS and is an independent risk factor for ACS. This study highlights the importance of genetic testing of FH-related gene mutations in patients with ACS.
Collapse
Affiliation(s)
- Cheng Wang
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Puliang Yu
- Wuhan University of Science and Technology, Wuhan, China
| | - Lizhi Hu
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Minglu Liang
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Yi Mao
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Qiutang Zeng
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Xiang Wang
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Kai Huang
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Jin Yan
- Department of Clinical Laboratory, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
| | - Li Xie
- Clinical Research Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Fengxiao Zhang
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- *Correspondence: Fengxiao Zhang,
| | - Feng Zhu
- Department of Cardiology, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- Clinic Center of Human Gene Research, Tongji Medical College, Union Hospital, Huazhong University of Science and Technology, Wuhan, China
- Feng Zhu
| |
Collapse
|
19
|
Garda S, Lenihan-Geels F, Proft S, Hochmuth S, Schülke M, Seelow D, Leser U. RegEl corpus: identifying DNA regulatory elements in the scientific literature. Database (Oxford) 2022; 2022:6618549. [PMID: 35758881 PMCID: PMC9235371 DOI: 10.1093/database/baac043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Revised: 05/25/2022] [Accepted: 06/02/2022] [Indexed: 11/17/2022]
Abstract
High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48–0.91 for entity detection and 0.71–0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg
Collapse
Affiliation(s)
- Samuele Garda
- Humboldt-Universitält zu Berlin Computer Science, , Rudower Chaussee 25, 12489, Berlin, Germany
| | - Freyda Lenihan-Geels
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Sebastian Proft
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin Bioinformatics and Translational Genetics, , Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany
- Charité-Universitätsmedizin Berlin Institut für Medizinische Genetik und Humangenetik, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Stefanie Hochmuth
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Markus Schülke
- Charité-Universitätsmedizin Berlin Klinik für Pädiatrie m.S. Neurologie, , Augustenburger Platz 1, 13353, Berlin, Germany
| | - Dominik Seelow
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin Bioinformatics and Translational Genetics, , Anna-Louisa-Karsch-Straße 2, 10178, Berlin, Germany
| | - Ulf Leser
- Humboldt-Universitält zu Berlin Computer Science, , Rudower Chaussee 25, 12489, Berlin, Germany
| |
Collapse
|
20
|
Wilhelm K, Edick MJ, Berry SA, Hartnett M, Brower A. Using Long-Term Follow-Up Data to Classify Genetic Variants in Newborn Screened Conditions. Front Genet 2022; 13:859837. [PMID: 35692825 PMCID: PMC9178101 DOI: 10.3389/fgene.2022.859837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 04/28/2022] [Indexed: 11/16/2022] Open
Abstract
With the rapid increase in publicly available sequencing data, healthcare professionals are tasked with understanding how genetic variation informs diagnosis and affects patient health outcomes. Understanding the impact of a genetic variant in disease could be used to predict susceptibility/protection and to help build a personalized medicine profile. In the United States, over 3.8 million newborns are screened for several rare genetic diseases each year, and the follow-up testing of screen-positive newborns often involves sequencing and the identification of variants. This presents the opportunity to use longitudinal health information from these newborns to inform the impact of variants identified in the course of diagnosis. To test this, we performed secondary analysis of a 10-year natural history study of individuals diagnosed with metabolic disorders included in newborn screening (NBS). We found 564 genetic variants with accompanying phenotypic data and identified that 161 of the 564 variants (29%) were not included in ClinVar. We were able to classify 139 of the 161 variants (86%) as pathogenic or likely pathogenic. This work demonstrates that secondary analysis of longitudinal data collected as part of NBS finds unreported genetic variants and the accompanying clinical information can inform the relationship between genotype and phenotype.
Collapse
Affiliation(s)
- Kevin Wilhelm
- Newborn Screening Translational Research Network, American College of Medical Genetics and Genomics, Bethesda, MD, United States
- Graduate Program in Genetics and Genomics, Graduate School of Biological Sciences, Baylor College of Medicine, Houston, TX, United States
| | | | - Susan A. Berry
- Department of Pediatrics, Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN, United States
| | - Michael Hartnett
- Newborn Screening Translational Research Network, American College of Medical Genetics and Genomics, Bethesda, MD, United States
| | - Amy Brower
- Newborn Screening Translational Research Network, American College of Medical Genetics and Genomics, Bethesda, MD, United States
| |
Collapse
|
21
|
Li PH, Chen TF, Yu JY, Shih SH, Su CH, Lin YH, Tsai HK, Juan HF, Chen CY, Huang JH. pubmedKB: an interactive web server for exploring biomedical entity relations in the biomedical literature. Nucleic Acids Res 2022; 50:W616-W622. [PMID: 35536289 PMCID: PMC9252824 DOI: 10.1093/nar/gkac310] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 04/06/2022] [Accepted: 04/18/2022] [Indexed: 11/15/2022] Open
Abstract
With the proliferation of genomic sequence data for biomedical research, the exploration of human genetic information by domain experts requires a comprehensive interrogation of large numbers of scientific publications in PubMed. However, a query in PubMed essentially provides search results sorted only by the date of publication. A search engine for retrieving and interpreting complex relations between biomedical concepts in scientific publications remains lacking. Here, we present pubmedKB, a web server designed to extract and visualize semantic relationships between four biomedical entity types: variants, genes, diseases, and chemicals. pubmedKB uses state-of-the-art natural language processing techniques to extract semantic relations from the large number of PubMed abstracts. Currently, over 2 million semantic relations between biomedical entity pairs are extracted from over 33 million PubMed abstracts in pubmedKB. pubmedKB has a user-friendly interface with an interactive semantic graph, enabling the user to easily query entities and explore entity relations. Supporting sentences with the highlighted snippets allow to easily navigate the publications. Combined with a new explorative approach to literature mining and an interactive interface for researchers, pubmedKB thus enables rapid, intelligent searching of the large biomedical literature to provide useful knowledge and insights. pubmedKB is available at https://www.pubmedkb.cc/.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Huai-Kuang Tsai
- Taiwan AI Labs, Taipei 10351, Taiwan.,Institute of Information Science, Academia Sinica, Taipei, 11529, Taiwan
| | - Hsueh-Fen Juan
- Taiwan AI Labs, Taipei 10351, Taiwan.,Department of Life Science, National Taiwan University, Taipei 10617, Taiwan.,Center for Computational and Systems Biology, National Taiwan University, Taipei 10617, Taiwan
| | - Chien-Yu Chen
- Taiwan AI Labs, Taipei 10351, Taiwan.,Center for Computational and Systems Biology, National Taiwan University, Taipei 10617, Taiwan.,Department of Biomechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan
| | | |
Collapse
|
22
|
Wang X, Zhu X, Peng L, Zhao Y. Identification of lncRNA Biomarkers and LINC01198 Promotes Progression of Chronic Rhinosinusitis with Nasal Polyps through Sponge miR-6776-5p. BIOMED RESEARCH INTERNATIONAL 2022; 2022:9469207. [PMID: 35572732 PMCID: PMC9106458 DOI: 10.1155/2022/9469207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 01/06/2022] [Accepted: 01/18/2022] [Indexed: 11/17/2022]
Abstract
Background Chronic sinusitis (CRS) was a chronic inflammation that originated in the nasal mucosa and affected the health of most people around the world. Chronic rhinosinusitis with nasal polyps (CRSwNP) was one kind of chronic sinusitis. Emerging research had suggested that long noncoding RNAs (lncRNAs) played vital parts in inflammatories and inflammation development. Methods We acquired GEO data to analyze the differential expression between the miRNA, immune genes, TF, and lncRNA data in CRSWNP and the corresponding control tissues. Bioinformatic analysis by coexpression of endogenous RNA network and competitive way enrichment, analysis, and forecasting functions of these noncoding RNA. The different pathway expressions in CRSwNP patients were confirmed using GSVA to analyze the differentially expressed immune genes and TF data sets in CRSwNP patients. The differential immune gene and transcription factor data set in CRSwNP perform functional notes and protein-protein interaction (PPI) network structure. We predicted the potential genes and RNAs related to CRSWNP by constructing a ceRNA network. In addition, we also used 19 hub immune genes to predict the potential drugs of CRSWNP. lncRNA biomarkers in CRSwNP were identified by lncRNAs LASSO regression. The CIBERSORT algorithm was used to contrast the divergence in immune infiltrations between CRSwNP and usual inferior turbinate organizations in 22 immunocyte subgroups. Results We identified a total of 48 miRNAs, 304 lncRNAs, 92 TFs, and 525 immune genes as CRSwNP-specific RNAs. GO and KEGG pathways both analyzed differentially expressed immune genes and transcription factor data sets. We predicted the potential genes GNG7, TUSC8, LINC01198, and has-miR-6776-5p by constructing ceRNA and PPI networks. At the same time, we found that the above genes were involved in two important pathways: chemokine signal path and PI3K/AKT signal path. In addition, we predicted 5 small molecule drugs to treat CRSwNP by analyzing 19 central immune genes, namely, danazol, ikarugamycin, semustine, cefamandole, and molindone. Finally, we identified 5 biomarkers in CRSwNP, namely, LINC01198, LINC01094, LINC01798, LINC01829, and LINC01320. Conclusions We had identified CRSwNP-related miRNAs, lncRNAs, TFs, and immune genes, which may be making use of latent therapeutic target for CRSwNP. At the same time, we identified 5 lncRNA biomarkers in CRSwNP. The results of this study showed that LINC01198 promoted the progression of CRSwNPs through spongy miR-6776-5p. Our studies provide a new way for further analyses of the pathogenesis of CRSwNP.
Collapse
Affiliation(s)
- Xueping Wang
- Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China 410000
| | - Xiaoyuan Zhu
- Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China 410000
| | - Li Peng
- Department of Obstetrics and Gynecology, The First People's Hospital of Nanyang, Nanyang, Henan, China 473000
| | - Yulin Zhao
- Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China 410000
| |
Collapse
|
23
|
Zhu X, Gu Y, Xiao Z. HerbKG: Constructing a Herbal-Molecular Medicine Knowledge Graph Using a Two-Stage Framework Based on Deep Transfer Learning. Front Genet 2022; 13:799349. [PMID: 35571049 PMCID: PMC9091197 DOI: 10.3389/fgene.2022.799349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 04/05/2022] [Indexed: 11/13/2022] Open
Abstract
Recent advances have witnessed a growth of herbalism studies adopting a modern scientific approach in molecular medicine, offering valuable domain knowledge that can potentially boost the development of herbalism with evidence-supported efficacy and safety. However, these domain-specific scientific findings have not been systematically organized, affecting the efficiency of knowledge discovery and usage. Existing knowledge graphs in herbalism mainly focus on diagnosis and treatment with an absence of knowledge connection with molecular medicine. To fill this gap, we present HerbKG, a knowledge graph that bridges herbal and molecular medicine. The core bio-entities of HerbKG include herbs, chemicals extracted from the herbs, genes that are affected by the chemicals, and diseases treated by herbs due to the functions of genes. We have developed a learning framework to automate the process of HerbKG construction. The resulting HerbKG, after analyzing over 500K PubMed abstracts, is populated with 53K relations, providing extensive herbal-molecular domain knowledge in support of downstream applications. The code and an interactive tool are available at https://github.com/FeiYee/HerbKG.
Collapse
Affiliation(s)
- Xian Zhu
- School of Information Management, Nanjing University, Nanjing, China
- School of Health Economics and Management, Nanjing University of Chinese Medicine, Nanjing, China
| | - Yueming Gu
- School of Computing and Information Systems, Faculty of Engineering and Information Technology, University of Melbourne, Parkville, VIC, Australia
| | - Zhifeng Xiao
- School of Engineering, Penn State Erie, The Behrend College, Erie, PA, United States
| |
Collapse
|
24
|
Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics 2022; 38:2595-2601. [PMID: 35274687 PMCID: PMC9048643 DOI: 10.1093/bioinformatics/btac146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 02/07/2022] [Accepted: 03/10/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. Results We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. Availability and implementation Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emilie Pasche
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Anaïs Mottaz
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Déborah Caucheteur
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Julien Gobeill
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Pierre-André Michel
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| | - Patrick Ruch
- SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland.,BiTeM Group, Information Sciences, 1227 Carouge, Switzerland HES-SO/HEG
| |
Collapse
|
25
|
Kaushik V, Plazzer JP, Winship I, Macrae F. Genetic variant interpretation: a primer for clinicians. Intern Med J 2021; 51:1401-1406. [PMID: 34541770 DOI: 10.1111/imj.15485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 02/08/2021] [Accepted: 05/31/2021] [Indexed: 11/28/2022]
Abstract
Clinicians beyond specialist genetic services are now able to order tests to interrogate the genetic basis of disease. Behind every genetic report lies a significant body of work that draws on decades of collaboration between clinicians, researchers and database curators. Understanding these advances in genetic variant interpretation may allow practising clinicians to develop a more nuanced appreciation of the role genetic variant interpretation can play in the diagnosis and management of heritable disorders. In this article, we consider genetic variant interpretation with reference to efforts to better understand variation in the mismatch repair genes and their relation to Lynch syndrome - the most common cause of hereditary colon cancer.
Collapse
Affiliation(s)
- Varun Kaushik
- The Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia.,Department of Colorectal Medicine and Genetics, The Royal Melbourne Hospital, Melbourne, Victoria, Australia
| | - John-Paul Plazzer
- Department of Colorectal Medicine and Genetics, The Royal Melbourne Hospital, Melbourne, Victoria, Australia
| | - Ingrid Winship
- Department of Genomic Medicine, The Royal Melbourne Hospital, Melbourne, Victoria, Australia.,Department of Medicine, The University of Melbourne, The Royal Melbourne Hospital, Melbourne, Victoria, Australia
| | - Finlay Macrae
- Department of Colorectal Medicine and Genetics, The Royal Melbourne Hospital, Melbourne, Victoria, Australia.,Department of Medicine, The University of Melbourne, The Royal Melbourne Hospital, Melbourne, Victoria, Australia
| |
Collapse
|
26
|
Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021; 12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. Results We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. Conclusions ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
Collapse
Affiliation(s)
- Ton E Becker
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA
| | - Eric Jakobsson
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA. .,Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
| |
Collapse
|
27
|
Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021; 118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Olga Printseva
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Oleg Rodionov
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
28
|
Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021; 22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION To obtain key information for personalized medicine and cancer research, clinicians and researchers in the biomedical field are in great need of searching genomic variant information from the biomedical literature now than ever before. Due to the various written forms of genomic variants, however, it is difficult to locate the right information from the literature when using a general literature search system. To address the difficulty of locating genomic variant information from the literature, researchers have suggested various solutions based on automated literature-mining techniques. There is, however, no study for summarizing and comparing existing tools for genomic variant literature mining in terms of how to search easily for information in the literature on genomic variants. RESULTS In this article, we systematically compared currently available genomic variant recognition and normalization tools as well as the literature search engines that adopted these literature-mining techniques. First, we explain the problems that are caused by the use of non-standard formats of genomic variants in the PubMed literature by considering examples from the literature and show the prevalence of the problem. Second, we review literature-mining tools that address the problem by recognizing and normalizing the various forms of genomic variants in the literature and systematically compare them. Third, we present and compare existing literature search engines that are designed for a genomic variant search by using the literature-mining techniques. We expect this work to be helpful for researchers who seek information about genomic variants from the literature, developers who integrate genomic variant information from the literature and beyond.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information
| | | | - Zhiyong Lu
- National Center for Biotechnology Information
| |
Collapse
|
29
|
Kaushik V, Plazzer J, Macrae F. Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:e10039. [PMID: 36618447 PMCID: PMC9744508 DOI: 10.1002/ggn2.10039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 01/12/2021] [Accepted: 01/14/2021] [Indexed: 01/11/2023]
Abstract
Pathogenic constitutional genomic variants in the mismatch repair (MMR) genes are the drivers of Lynch syndrome; optimal variant interpretation is required for the management of suspected and confirmed cases. The International Society for Hereditary Gastrointestinal Tumours (InSiGHT) provides expert classifications for MMR variants for the US National Human Genome Research Institute's (NHGRI) ClinGen initiative and interprets variants with discordant classifications and those of uncertain significance (VUSs). Given the onerous nature of extracting information related to variants, literature searching tools which harness artificial intelligence may aid in retrieving information to allow optimum variant classification. In this study, we described the nature of discordance in a sample of 80 variants from a list of variants requiring updating by InSiGHT for ClinGen by comparing their existing InSiGHT classifications with the various submissions for each variant on the US National Centre for Biotechnology Information's (NCBI) ClinVar database. To identify the potential value of a literature searching tool in extracting information related to classification, all variants were searched for using a traditional method (Google Scholar) and literature searching tool (Mastermind) independently. Descriptive statistics were used to compare: the number of articles before and after screening for relevance and the number of relevant articles unique to either method. Relevance was defined as containing the variant in question as well as data informing variant interpretation. A total of 916 articles were returned by both methods and Mastermind averaged four relevant articles per search compared to Google Scholar's three. Of relevant Mastermind articles, 193/308 (62.7%) were unique to it, compared to 87/202, (43.0%) for Google Scholar. For 24 variants, either or both methods found no information. All 6/80 (20%) variants with pathogenic or likely pathogenic InSiGHT classifications have newer VUS assertions on ClinVar. Our study demonstrated that for a sample of variants with varying discordant interpretations, Mastermind was able to return on average, a more relevant and unique literature search. Google Scholar was able to retrieve information that Mastermind did not, which supports a conclusion that Mastermind could play a complementary role in literature searching for classification. This work will aid InSiGHT in its role of classifying MMR variants.
Collapse
Affiliation(s)
- Varun Kaushik
- Melbourne Medical SchoolThe University of MelbourneParkvilleVictoriaAustralia,Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - John‐Paul Plazzer
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia
| | - Finlay Macrae
- Department of Colorectal Medicine and Genetics, The Royal Melbourne HospitalParkvilleVictoriaAustralia,Department of Medicine, The Royal Melbourne HospitalThe University of MelbourneParkvilleVictoriaAustralia
| |
Collapse
|
30
|
Martin RL. Gene-Centric Database Reveals Environmental and Lifestyle Relationships for Potential Risk Modification and Prevention. Lifestyle Genom 2021; 14:30-36. [PMID: 33461193 DOI: 10.1159/000512690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 10/29/2020] [Indexed: 11/19/2022] Open
Abstract
The database at Nutrigenetics.net has been under development since 2007 to facilitate the identification and classification of PubMed articles relevant to human genetics. A controlled vocabulary (i.e., standardized terminology) is used to index these records, with links back to PubMed for every article title. This enables the display of indexes (alphabetical subtopic listings) for any given topic, or for any given combination of topics, including for genes and specific genetic variants. Stepwise use of such indexes (first for one topic, then for combinations of topics) can reveal relationships that are otherwise easily overlooked. These relationships include environmental and lifestyle variables with potential relevance to risk modification (both beneficial and detrimental), and to prevention, or at least to the potential delay of symptom onset for health conditions like Alzheimer disease among many others. Thirty-four specific genetic variants have each been mentioned in at least ≥1,000 PubMed titles/abstracts, and these numbers are steadily increasing. The benefits of indexing with standardized terminology are illustrated for genetic variants like MTHFR 677C-T and its various synonyms (e.g., rs1801133 or Ala222Val). Such use of a controlled vocabulary is also helpful for numerous health conditions, and for potential risk modifiers (i.e., potential risk/effect modifiers).
Collapse
Affiliation(s)
- Ron L Martin
- Nutrigenetics Unlimited, Inc., Fullerton, California, USA,
| |
Collapse
|
31
|
Dingerdissen HM, Bastian F, Vijay-Shanker K, Robinson-Rechavi M, Bell A, Gogate N, Gupta S, Holmes E, Kahsay R, Keeney J, Kincaid H, King CH, Liu D, Crichton DJ, Mazumder R. OncoMX: A Knowledgebase for Exploring Cancer Biomarkers in the Context of Related Cancer and Healthy Data. JCO Clin Cancer Inform 2020; 4:210-220. [PMID: 32142370 PMCID: PMC7101249 DOI: 10.1200/cci.19.00117] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
PURPOSE The purpose of OncoMX1 knowledgebase development was to integrate cancer biomarker and relevant data types into a meta-portal, enabling the research of cancer biomarkers side by side with other pertinent multidimensional data types. METHODS Cancer mutation, cancer differential expression, cancer expression specificity, healthy gene expression from human and mouse, literature mining for cancer mutation and cancer expression, and biomarker data were integrated, unified by relevant biomedical ontologies, and subjected to rule-based automated quality control before ingestion into the database. RESULTS OncoMX provides integrated data encompassing more than 1,000 unique biomarker entries (939 from the Early Detection Research Network [EDRN] and 96 from the US Food and Drug Administration) mapped to 20,576 genes that have either mutation or differential expression in cancer. Sentences reporting mutation or differential expression in cancer were extracted from more than 40,000 publications, and healthy gene expression data with samples mapped to organs are available for both human genes and their mouse orthologs. CONCLUSION OncoMX has prioritized user feedback as a means of guiding development priorities. By mapping to and integrating data from several cancer genomics resources, it is hoped that OncoMX will foster a dynamic engagement between bioinformaticians and cancer biomarker researchers. This engagement should culminate in a community resource that substantially improves the ability and efficiency of exploring cancer biomarker data and related multidimensional data.
Collapse
Affiliation(s)
| | - Frederic Bastian
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| | | | - Marc Robinson-Rechavi
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
| | - Amanda Bell
- The George Washington University, Washington DC
| | | | | | - Evan Holmes
- The George Washington University, Washington DC
| | | | | | | | | | - David Liu
- NASA Jet Propulsion Laboratory, Pasadena, CA
| | | | | |
Collapse
|
32
|
Rehmat N, Farooq H, Kumar S, Ul Hussain S, Naveed H. Predicting the pathogenicity of protein coding mutations using Natural Language Processing. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:5842-5846. [PMID: 33019302 DOI: 10.1109/embc44109.2020.9175781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
DNA-Sequencing of tumor cells has revealed thousands of genetic mutations. However, cancer is caused by only some of them. Identifying mutations that contribute to tumor growth from neutral ones is extremely challenging and is currently carried out manually. This manual annotation is very cumbersome and expensive in terms of time and money. In this study, we introduce a novel method "NLP-SNPPred" to read scientific literature and learn the implicit features that cause certain variations to be pathogenic. Precisely, our method ingests the bio-medical literature and produces its vector representation via exploiting state of the art NLP methods like sent2vec, word2vec and tf-idf. These representations are then fed to machine learning predictors to identify the pathogenic versus neutral variations. Our best model (NLPSNPPred) trained on OncoKB and evaluated on several publicly available benchmark datasets, outperformed state of the art function prediction methods. Our results show that NLP can be used effectively in predicting functional impact of protein coding variations with minimal complementary biological features. Moreover, encoding biological knowledge into the right representations, combined with machine learning methods can help in automating manual efforts. A free to use web-server is available at http://www.nlp-snppred.cbrlab.org.
Collapse
|
33
|
Sarmah P, Bharali R, Khatonier R, Khan A. Polymorphism in Toll interacting protein (TOLLIP) gene and its association with Visceral Leishmaniasis. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
34
|
Saberian N, Shafi A, Peyvandipour A, Draghici S. MAGPEL: an autoMated pipeline for inferring vAriant-driven Gene PanEls from the full-length biomedical literature. Sci Rep 2020; 10:12365. [PMID: 32703994 PMCID: PMC7378213 DOI: 10.1038/s41598-020-68649-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 06/17/2020] [Indexed: 11/09/2022] Open
Abstract
In spite of the efforts in developing and maintaining accurate variant databases, a large number of disease-associated variants are still hidden in the biomedical literature. Curation of the biomedical literature in an effort to extract this information is a challenging task due to: (i) the complexity of natural language processing, (ii) inconsistent use of standard recommendations for variant description, and (iii) the lack of clarity and consistency in describing the variant-genotype-phenotype associations in the biomedical literature. In this article, we employ text mining and word cloud analysis techniques to address these challenges. The proposed framework extracts the variant-gene-disease associations from the full-length biomedical literature and designs an evidence-based variant-driven gene panel for a given condition. We validate the identified genes by showing their diagnostic abilities to predict the patients' clinical outcome on several independent validation cohorts. As representative examples, we present our results for acute myeloid leukemia (AML), breast cancer and prostate cancer. We compare these panels with other variant-driven gene panels obtained from Clinvar, Mastermind and others from literature, as well as with a panel identified with a classical differentially expressed genes (DEGs) approach. The results show that the panels obtained by the proposed framework yield better results than the other gene panels currently available in the literature.
Collapse
Affiliation(s)
- Nafiseh Saberian
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Azam Peyvandipour
- Department of Computer Science, Wayne State University, Detroit, MI, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA.
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA.
| |
Collapse
|
35
|
Alag S. Unique insights from ClinicalTrials.gov by mining protein mutations and RSids in addition to applying the Human Phenotype Ontology. PLoS One 2020; 15:e0233438. [PMID: 32459809 PMCID: PMC7252633 DOI: 10.1371/journal.pone.0233438] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2019] [Accepted: 05/05/2020] [Indexed: 01/31/2023] Open
Abstract
Researchers and clinicians face a significant challenge in keeping up-to-date with the rapid rate of new associations between genetic mutations and diseases. To remedy this problem, this research mined the ClinicalTrials.gov corpus to extract relevant biological insights, produce unique reports to summarize findings, and make the meta-data available via APIs. An automated text-analysis pipeline performed the following features: parsing the ClinicalTrials.gov files, extracting and analyzing mutations from the corpus, mapping clinical trials to Human Phenotype Ontology (HPO), and finding associations between clinical trials and HPO nodes. Unique reports were created for each mutation (SNPs and protein mutations) mentioned in the corpus, as well as for each clinical trial that references a mutation. These reports, which have been run over multiple time points, along with APIs to access meta-data, are freely available at http://snpminertrials.com. Additionally, HPO was used to normalize disease terms and associate clinical trials with relevant genes. The creation of the pipeline and reports, the association of clinical trials with HPO terms, and the insights, public repository, and APIs produced are all novel in this work. The freely-available resources present relevant biological information and novel insights between biomedical entities in a robust and accessible manner, mitigating the challenge of being informed about new associations between mutations, genes, and diseases.
Collapse
Affiliation(s)
- Shray Alag
- The Harker School, San Jose, CA, United States of America
- * E-mail:
| |
Collapse
|
36
|
Abstract
The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
Collapse
Affiliation(s)
- The UniProt Consortium
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, CH-1211 Geneva 4, Switzerland
- Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, Washington, DC 20007, USA
- Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark DE 19711, USA
- To whom correspondence should be addressed. Tel: +44 1223 494 100; Fax: +44 1223 494 468;
| |
Collapse
|
37
|
Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020; 47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open
Abstract
PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
38
|
Huang KY, Lee TY, Kao HJ, Ma CT, Lee CC, Lin TH, Chang WC, Huang HD. dbPTM in 2019: exploring disease association and cross-talk of post-translational modifications. Nucleic Acids Res 2020; 47:D298-D308. [PMID: 30418626 PMCID: PMC6323979 DOI: 10.1093/nar/gky1074] [Citation(s) in RCA: 138] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 10/19/2018] [Indexed: 12/25/2022] Open
Abstract
The dbPTM (http://dbPTM.mbc.nctu.edu.tw/) has been maintained for over 10 years with the aim to provide functional and structural analyses for post-translational modifications (PTMs). In this update, dbPTM not only integrates more experimentally validated PTMs from available databases and through manual curation of literature but also provides PTM-disease associations based on non-synonymous single nucleotide polymorphisms (nsSNPs). The high-throughput deep sequencing technology has led to a surge in the data generated through analysis of association between SNPs and diseases, both in terms of growth amount and scope. This update thus integrated disease-associated nsSNPs from dbSNP based on genome-wide association studies. The PTM substrate sites located at a specified distance in terms of the amino acids encoded from nsSNPs were deemed to have an association with the involved diseases. In recent years, increasing evidence for crosstalk between PTMs has been reported. Although mass spectrometry-based proteomics has substantially improved our knowledge about substrate site specificity of single PTMs, the fact that the crosstalk of combinatorial PTMs may act in concert with the regulation of protein function and activity is neglected. Because of the relatively limited information about concurrent frequency and functional relevance of PTM crosstalk, in this update, the PTM sites neighboring other PTM sites in a specified window length were subjected to motif discovery and functional enrichment analysis. This update highlights the current challenges in PTM crosstalk investigation and breaks the bottleneck of how proteomics may contribute to understanding PTM codes, revealing the next level of data complexity and proteomic limitation in prospective PTM research.
Collapse
Affiliation(s)
- Kai-Yao Huang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Life and Health Science, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Life and Health Science, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Hui-Ju Kao
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 32003, Taiwan
| | - Chen-Tse Ma
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 32003, Taiwan
| | - Chao-Chun Lee
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 32003, Taiwan
| | - Tsai-Hsuan Lin
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 32003, Taiwan
| | - Wen-Chi Chang
- Institute of Tropical Plant Sciences, College of Biosciences and Biotechnology, National Cheng Kung University, Tainan 70101, Taiwan
| | - Hsien-Da Huang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China.,School of Life and Health Science, The Chinese University of Hong Kong, Shenzhen 518172, China
| |
Collapse
|
39
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
40
|
Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020; 16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
41
|
Mutation profiles of classic myeloproliferative neoplasms detected by a customized next-generation sequencing-based 50-gene panel. JOURNAL OF BIO-X RESEARCH 2020. [DOI: 10.1097/jbr.0000000000000061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
|
42
|
Nie A, Pineda AL, Wright MW, Wand H, Wulf B, Costa H, Patel R, Bustamante CD, Zou J. LitGen: Genetic Literature Recommendation Guided by Human Explanations. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:67-78. [PMID: 31797587 PMCID: PMC7478937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences-e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)-the flagship NIH program for clinical curation-we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evi+dence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.
Collapse
Affiliation(s)
- Allen Nie
- Department of Biomedical Data Science, Stanford University School of Medicine,Department of Computer Science, Stanford University
| | - Arturo L. Pineda
- Department of Biomedical Data Science, Stanford University School of Medicine
| | - Matt W. Wright
- Department of Biomedical Data Science, Stanford University School of Medicine,Department of Pathology, Stanford University School of Medicine
| | - Hannah Wand
- Department of Biomedical Data Science, Stanford University School of Medicine,Department of Pathology, Stanford University School of Medicine,Department of Cardiology, Stanford Healthcare
| | - Bryan Wulf
- Department of Biomedical Data Science, Stanford University School of Medicine
| | - Helio Costa
- Department of Biomedical Data Science, Stanford University School of Medicine
| | - Ronak Patel
- Department of Molecular and Human Genetics, Baylor College of Medicine
| | - Carlos D. Bustamante
- Department of Biomedical Data Science, Stanford University School of Medicine,Chan-Zuckerberg Biohub
| | - James Zou
- Department of Biomedical Data Science, Stanford University School of Medicine,Department of Computer Science, Stanford University,Chan-Zuckerberg Biohub
| |
Collapse
|
43
|
PGxMine: Text mining for curation of PharmGKB. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:611-622. [PMID: 31797632 PMCID: PMC6917032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB's scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.
Collapse
|
44
|
Jiang Y, Wu C, Zhang Y, Zhang S, Yu S, Lei P, Lu Q, Xi Y, Wang H, Song Z. GTX.Digest.VCF: an online NGS data interpretation system based on intelligent gene ranking and large-scale text mining. BMC Med Genomics 2019; 12:193. [PMID: 31856831 PMCID: PMC6923899 DOI: 10.1186/s12920-019-0637-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 11/26/2019] [Indexed: 02/07/2023] Open
Abstract
Background An important task in the interpretation of sequencing data is to highlight pathogenic genes (or detrimental variants) in the field of Mendelian diseases. It is still challenging despite the recent rapid development of genomics and bioinformatics. A typical interpretation workflow includes annotation, filtration, manual inspection and literature review. Those steps are time-consuming and error-prone in the absence of systematic support. Therefore, we developed GTX.Digest.VCF, an online DNA sequencing interpretation system, which prioritizes genes and variants for novel disease-gene relation discovery and integrates text mining results to provide literature evidence for the discovery. Its phenotype-driven ranking and biological data mining approach significantly speed up the whole interpretation process. Results The GTX.Digest.VCF system is freely available as a web portal at http://vcf.gtxlab.com for academic research. Evaluation on the DDD project dataset demonstrates an accuracy of 77% (235 out of 305 cases) for top-50 genes and an accuracy of 41.6% (127 out of 305 cases) for top-5 genes. Conclusions GTX.Digest.VCF provides an intelligent web portal for genomics data interpretation via the integration of bioinformatics tools, distributed parallel computing, biomedical text mining. It can facilitate the application of genomic analytics in clinical research and practices.
Collapse
Affiliation(s)
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha, 410073, China
| | - Yanghui Zhang
- NHC key laboratory of birth defects research, prevention and treatment (Hunan Provincial Maternal and Child Health Care Hospital), NO.53 Xiangchun Road, Changsha, 410008, Hunan, China
| | - Shaowei Zhang
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Shuojun Yu
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Peng Lei
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Qin Lu
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China
| | - Yanwei Xi
- Cytogenetics and Human Molecular Genetics Laboratories, Royal University Hospital, Saskatoon, SK, Canada
| | - Hua Wang
- NHC key laboratory of birth defects research, prevention and treatment (Hunan Provincial Maternal and Child Health Care Hospital), NO.53 Xiangchun Road, Changsha, 410008, Hunan, China. .,Hunan Provincial Maternal and Child Health Care Hospital, Changsha, 410073, China.
| | - Zhuo Song
- Genetalks Biotech. Co., Ltd., Changsha, 410000, China.
| |
Collapse
|
45
|
Abstract
Next generation sequencing (NGS) represents several powerful platforms that have revolutionized RNA and DNA analysis. The parallel sequencing of millions of DNA molecules can provide mechanistic insights into toxicology and provide new avenues for biomarker discovery with growing relevance for risk assessment. The evolution of NGS technologies has improved over the last decade with increased sensitivity and accuracy to foster new biomarker assays from tissue, blood and other biofluids. NGS sequencing technologies can identify transcriptional changes and genomic targets with base pair precision in response to chemical exposure. Further, there are several exciting movements within the toxicology community that incorporate NGS platforms into new strategies for more rapid toxicological characterizations. These include the Tox21 in vitro high throughput transcriptomic screening program, development of organotypic spheroids, alternative animal models, mining archival tissues, liquid biopsy and epigenomics. This review will describe NGS-based technologies, demonstrate how they can be used as tools for target discovery in tissue and blood, and suggest how they might be applied for risk assessment.
Collapse
Affiliation(s)
- B Alex Merrick
- Molecular and Genomic Toxicology Group, Biomolecular Screening Branch, Division National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, Ph: 919-541-1531,
| |
Collapse
|
46
|
Li L, Feng J, Zhang D, Yong J, Wang Y, Yao J, Huang R. Differential expression of miR-4492 and IL-10 is involved in chronic rhinosinusitis with nasal polyps. Exp Ther Med 2019; 18:3968-3976. [PMID: 31611936 PMCID: PMC6781800 DOI: 10.3892/etm.2019.8022] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Accepted: 07/05/2019] [Indexed: 12/22/2022] Open
Abstract
Chronic rhinosinusitis (CRS) is one of the most common types of respiratory disease and affects a large proportion of the population worldwide. The clinical differences between CRS with nasal polyps (CRSwNP) and CRS without nasal polyps (CRSsNP) facilitate studies on the development of polyps. In the present study, next-generation sequencing was performed to identify differentially expressed microRNAs (miRNAs/miRs) in CRSwNP vs. CRSsNP tissues and subsequently validated the expression of selected genes using reverse transcription-quantitative polymerase chain reaction analysis. In total, 6 miRNAs were identified to be downregulated in the CRSwNP samples compared with those in the CRSsNP samples. The predicted targets of these miRNAs were determined to be enriched in a number of signaling pathways, including the ErbB, Ras, cyclic adenosine monophosphate and Janus kinase (Jak)/signal transducer and activator of transcription (STAT) pathways. The expression of miR-4492 and that if its targets predicted by a bioinformatics analysis, tumor necrosis factor α (TNF-α) and interleukin (IL)-10, was validated in 96 clinical specimens. miR-4492 was downregulated and IL-10 was upregulated in CRSwNP vs. CRSsNP tissues, and an inverse correlation between miR-4492 and IL-10 was determined in CRS tissues; however no difference was identified in the expression of TNF-α between the different groups. The present results indicate that the miR-4492/IL-10 interaction in the Jak/STAT signaling pathway may be one of the key mechanisms in CRSwNP.
Collapse
Affiliation(s)
- Linge Li
- Department of Otorhinolaryngology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830054, P.R. China
| | - Juan Feng
- Department of Otorhinolaryngology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830054, P.R. China
| | - Dinghao Zhang
- Department of Otorhinolaryngology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830054, P.R. China
| | - Jun Yong
- Department of Otorhinolaryngology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830054, P.R. China
| | - Yan Wang
- Department of Otorhinolaryngology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830054, P.R. China
| | - Jianfeng Yao
- Reproductive Medicine Center, Quanzhou Maternal and Child Health Hospital, Fujian Medical University, Quanzhou, Fujian 362000, P.R. China
| | - Rongfu Huang
- Clinical Laboratory, The Second Affiliated Hospital, Fujian Medical University, Quanzhou, Fujian 362000, P.R. China
| |
Collapse
|
47
|
Kwon D, Kim S, Wei CH, Leaman R, Lu Z. ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 2019; 46:W523-W529. [PMID: 29788413 PMCID: PMC6030907 DOI: 10.1093/nar/gky428] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 05/07/2018] [Indexed: 12/22/2022] Open
Abstract
Recently, advanced text-mining techniques have been shown to speed up manual data curation by providing human annotators with automated pre-annotations generated by rules or machine learning models. Due to the limited training data available, however, current annotation systems primarily focus only on common concept types such as genes or diseases. To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. It also provides lexicon-based concept tagging as well as the state-of-the-art pre-trained taggers such as TaggerOne, GNormPlus and tmVar. ezTag is freely available at http://eztag.bioqrator.org.
Collapse
Affiliation(s)
- Dongseop Kwon
- School of Software Convergence, Myongji University, Seoul 03674, South Korea
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
48
|
Allot A, Peng Y, Wei CH, Lee K, Phan L, Lu Z. LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 2019; 46:W530-W536. [PMID: 29762787 PMCID: PMC6030971 DOI: 10.1093/nar/gky355] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 05/08/2018] [Indexed: 01/10/2023] Open
Abstract
The identification and interpretation of genomic variants play a key role in the diagnosis of genetic diseases and related research. These tasks increasingly rely on accessing relevant manually curated information from domain databases (e.g. SwissProt or ClinVar). However, due to the sheer volume of medical literature and high cost of expert curation, curated variant information in existing databases are often incomplete and out-of-date. In addition, the same genetic variant can be mentioned in publications with various names (e.g. ‘A146T’ versus ‘c.436G>A’ versus ‘rs121913527’). A search in PubMed using only one name usually cannot retrieve all relevant articles for the variant of interest. Hence, to help scientists, healthcare professionals, and database curators find the most up-to-date published variant research, we have developed LitVar for the search and retrieval of standardized variant information. In addition, LitVar uses advanced text mining techniques to compute and extract relationships between variants and other associated entities such as diseases and chemicals/drugs. LitVar is publicly available at https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yifan Peng
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
49
|
Association of imputed prostate cancer transcriptome with disease risk reveals novel mechanisms. Nat Commun 2019; 10:3107. [PMID: 31308362 PMCID: PMC6629701 DOI: 10.1038/s41467-019-10808-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Accepted: 06/04/2019] [Indexed: 12/16/2022] Open
Abstract
Here we train cis-regulatory models of prostate tissue gene expression and impute expression transcriptome-wide for 233,955 European ancestry men (14,616 prostate cancer (PrCa) cases, 219,339 controls) from two large cohorts. Among 12,014 genes evaluated in the UK Biobank, we identify 38 associated with PrCa, many replicating in the Kaiser Permanente RPGEH. We report the association of elevated TMPRSS2 expression with increased PrCa risk (independent of a previously-reported risk variant) and with increased tumoral expression of the TMPRSS2:ERG fusion-oncogene in The Cancer Genome Atlas, suggesting a novel germline-somatic interaction mechanism. Three novel genes, HOXA4, KLK1, and TIMM23, additionally replicate in the RPGEH cohort. Furthermore, 4 genes, MSMB, NCOA4, PCAT1, and PPP1R14A, are associated with PrCa in a trans-ethnic meta-analysis (N = 9117). Many genes exhibit evidence for allele-specific transcriptional activation by PrCa master-regulators (including androgen receptor) in Position Weight Matrix, Chip-Seq, and Hi-C experimental data, suggesting common regulatory mechanisms for the associated genes.
Collapse
|
50
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|