1
|
Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023; 39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, München 80539, Germany
| | - Robert Martin
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
2
|
Mostafa T, Abdel-Hamid I, Taymour M, Ali O. Genetic variants in varicocele-related male infertility: a systematic review and future directions. HUM FERTIL 2023; 26:632-648. [PMID: 34587863 DOI: 10.1080/14647273.2021.1983214] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Accepted: 05/12/2021] [Indexed: 02/08/2023]
Abstract
Genetic association studies (GAS) may have the capability to probe the genetic susceptibility alleles in many disorders. This systemic review aimed to assess whether an association exists between gene(s)/allelic variant(s), and varicocele-related male infertility (VRMI). This review included 19 GAS that investigated 26 genes in 1,826 men with varicocele compared to 2,070 healthy men, and 263 infertile men without varicocele. These studies focussed on candidate genes and relevant variants, with glutathione S-transferase gene being the most frequently studied (n = 5) followed by the nitric oxide synthase 3 (NOS3) gene (n = 3) and the phosphoprotein tyrosine phosphatase 1 gene (n = 2). In one study the genes for NAD(P)H quinone oxidoreductase 1, sperm protamine, human 8-oxoguanine DNA glycosylase 1, methylenetetrahydrofolate reductase, polymerase gamma, heat shock protein 90, mitochondrial DNA, superoxide dismutase 2, transition nuclear protein 1, and transition nuclear protein 2, were assessed. There is no clear indication that any of these polymorphisms are sturdily associated with VRMI. However, three studies established that the polymorphic genotype (GT + TT) for rs1799983 polymorphism of the NOS3 gene is more frequent in varicocele patients. Further endeavours such as standardising reporting, exploring complementary designs, and the use of GWAS technology are justified to help replicate these early findings.
Collapse
Affiliation(s)
- Taymour Mostafa
- Andrology, Sexology & STIs Department, Faculty of Medicine, Cairo University, Cairo, Egypt
| | - Ibrahim Abdel-Hamid
- Division of Andrology, Faculty of Medicine, Mansoura University, Mansoura, Egypt
| | - Mai Taymour
- Dermatology & Andrology specialist, Cairo, Egypt
| | - Omar Ali
- Faculty of Medicine and Surgery, 6th October University, Giza, Egypt
| |
Collapse
|
3
|
Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021; 12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. Results We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. Conclusions ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
Collapse
Affiliation(s)
- Ton E Becker
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA
| | - Eric Jakobsson
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA. .,Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
| |
Collapse
|
4
|
Lee K, Wei CH, Lu Z. Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 2021; 22:bbaa142. [PMID: 32770181 PMCID: PMC8138883 DOI: 10.1093/bib/bbaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/07/2020] [Accepted: 06/25/2020] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION To obtain key information for personalized medicine and cancer research, clinicians and researchers in the biomedical field are in great need of searching genomic variant information from the biomedical literature now than ever before. Due to the various written forms of genomic variants, however, it is difficult to locate the right information from the literature when using a general literature search system. To address the difficulty of locating genomic variant information from the literature, researchers have suggested various solutions based on automated literature-mining techniques. There is, however, no study for summarizing and comparing existing tools for genomic variant literature mining in terms of how to search easily for information in the literature on genomic variants. RESULTS In this article, we systematically compared currently available genomic variant recognition and normalization tools as well as the literature search engines that adopted these literature-mining techniques. First, we explain the problems that are caused by the use of non-standard formats of genomic variants in the PubMed literature by considering examples from the literature and show the prevalence of the problem. Second, we review literature-mining tools that address the problem by recognizing and normalizing the various forms of genomic variants in the literature and systematically compare them. Third, we present and compare existing literature search engines that are designed for a genomic variant search by using the literature-mining techniques. We expect this work to be helpful for researchers who seek information about genomic variants from the literature, developers who integrate genomic variant information from the literature and beyond.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information
| | | | - Zhiyong Lu
- National Center for Biotechnology Information
| |
Collapse
|
5
|
Rehmat N, Farooq H, Kumar S, Ul Hussain S, Naveed H. Predicting the pathogenicity of protein coding mutations using Natural Language Processing. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:5842-5846. [PMID: 33019302 DOI: 10.1109/embc44109.2020.9175781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
DNA-Sequencing of tumor cells has revealed thousands of genetic mutations. However, cancer is caused by only some of them. Identifying mutations that contribute to tumor growth from neutral ones is extremely challenging and is currently carried out manually. This manual annotation is very cumbersome and expensive in terms of time and money. In this study, we introduce a novel method "NLP-SNPPred" to read scientific literature and learn the implicit features that cause certain variations to be pathogenic. Precisely, our method ingests the bio-medical literature and produces its vector representation via exploiting state of the art NLP methods like sent2vec, word2vec and tf-idf. These representations are then fed to machine learning predictors to identify the pathogenic versus neutral variations. Our best model (NLPSNPPred) trained on OncoKB and evaluated on several publicly available benchmark datasets, outperformed state of the art function prediction methods. Our results show that NLP can be used effectively in predicting functional impact of protein coding variations with minimal complementary biological features. Moreover, encoding biological knowledge into the right representations, combined with machine learning methods can help in automating manual efforts. A free to use web-server is available at http://www.nlp-snppred.cbrlab.org.
Collapse
|
6
|
Huang MS, Lai PT, Lin PY, You YT, Tsai RTH, Hsu WL. Biomedical named entity recognition and linking datasets: survey and our recent development. Brief Bioinform 2020; 21:2219-2238. [DOI: 10.1093/bib/bbaa054] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 02/29/2020] [Accepted: 03/31/2020] [Indexed: 11/14/2022] Open
Abstract
AbstractNatural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein–protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein–protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Po-Ting Lai
- Institute of Biomedical Informatics, National Yang Ming University, Taipei, Taiwan
| | - Pei-Yen Lin
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
7
|
Xu J, Kim S, Song M, Jeong M, Kim D, Kang J, Rousseau JF, Li X, Xu W, Torvik VI, Bu Y, Chen C, Ebeid IA, Li D, Ding Y. Building a PubMed knowledge graph. Sci Data 2020; 7:205. [PMID: 32591513 PMCID: PMC7320186 DOI: 10.1038/s41597-020-0543-2] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/26/2020] [Indexed: 01/08/2023] Open
Abstract
PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. Measurement(s) | textual entity • author information textual entity • funding source declaration textual entity • abstract • Biologic Entity Classification | Technology Type(s) | machine learning • computational modeling technique |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.12452597
Collapse
Affiliation(s)
- Jian Xu
- School of Information Management, Sun Yat-sen University, Guangzhou, China
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, South Korea
| | - Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | | | - Xin Li
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Weijia Xu
- Texas Advanced Computing Center, Austin, TX, USA
| | - Vetle I Torvik
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Yi Bu
- Department of Information Management, Peking University, Beijing, China
| | - Chongyan Chen
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Islam Akef Ebeid
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Daifeng Li
- School of Information Management, Sun Yat-sen University, Guangzhou, China.
| | - Ying Ding
- Dell Medical School, University of Texas at Austin, Austin, TX, USA. .,School of Information, University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
8
|
Yi C, Huang C, Wang H, Wang C, Dong L, Gu X, Feng X, Chen B. Association study between CYP24A1 gene polymorphisms and cancer risk. Pathol Res Pract 2019; 216:152735. [PMID: 31740231 DOI: 10.1016/j.prp.2019.152735] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 10/23/2019] [Accepted: 11/10/2019] [Indexed: 02/06/2023]
Abstract
CYP24A1, an essential gene in regulation of vitamin D, has been reported to play an important role in enhancing immune activity and inhibiting tumorigenesis. Previous studies proposed that rs2585428, rs4809960, rs6022999 and rs6068816 in CYP24A1 gene might be greatly associated with cancer risk. To validate the findings, we here investigated the associations of these four polymorphisms and colorectal cancer (CRC) risk in a central Chinese population (426 colon cancer patients, 361 rectal cancer patients and 800 healthy controls). The genotyping was conducted by polymerase chain reaction-restriction fragment length polymorphism (PCR-RFLP) and confirmed by sequencing. Our results revealed that the rs4809960 and rs6022999 were strongly associated with the CRC risk, especially with the colon cancer risk. Moreover, the analysis of haplotypes consisting of rs2585428(G > A), rs4809960(T > C), rs6022999(A > G) and rs6068816(C > T) indicated that haplotype ATGC significantly decreased the CRC risk, especially the colon cancer risk. Haplotype GCAT significantly increased the CRC risk, especially the rectal cancer risk. However, haplotype ACAC was only found to be associated with increased risk of CRC. To improve the statistical strength, an updated meta-analysis was further performed. The results showed that rs2585428 was associated with cancer risk in Caucasian population, rs4809960 was associated with breast cancer risk in Caucasian population, and rs6022999 was associated with cancer risk in Asian population. Collectively, the rs4809960 and rs6022999 may be the genetic biomarkers for prediction of colon cancer risk in Chinese population, the rs2585428 and rs6022999 may link to cancer susceptibility in Caucasian population and in Asian population respectly.
Collapse
Affiliation(s)
- Can Yi
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China
| | - Chao Huang
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China
| | - Huan Wang
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China
| | - Chen Wang
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China
| | - Lijuan Dong
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China
| | - Xiuli Gu
- Center of Reproductive Medicine, Tongji Medical College, Huazhong University of Science and Technology, China; Department of Reproductive Genetics, Wuhan Tongji Reproductive Medicine Hospital, Wuhan, China
| | - Xianhong Feng
- Clinical Laboratory, Wuhan Xinzhou District People's Hospital, Wuhan, China
| | - Bifeng Chen
- Department of Biological Science and Technology, School of Chemistry, Chemical Engineering and Life Sciences, Wuhan University of Technology, Wuhan, China.
| |
Collapse
|
9
|
Galea D, Laponogov I, Veselkov K. Exploiting and assessing multi-source data for supervised biomedical named entity recognition. Bioinformatics 2019. [PMID: 29538614 PMCID: PMC6041968 DOI: 10.1093/bioinformatics/bty152] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Motivation Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed. Results Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model 'overtraining') which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data. Availability and implementation Compiled primary and secondary sources of the aggregated corpora are available on: https://github.com/dterg/biomedical_corpora/wiki and https://bitbucket.org/iAnalytica/bioner. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dieter Galea
- Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Ivan Laponogov
- Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Kirill Veselkov
- Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| |
Collapse
|
10
|
Tawfik NS, Spruit MR. The SNPcurator: literature mining of enriched SNP-disease associations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4925332. [PMID: 29688369 PMCID: PMC5844215 DOI: 10.1093/database/bay020] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Accepted: 02/05/2018] [Indexed: 01/08/2023]
Abstract
The uniqueness of each human genetic structure motivated the shift from the current practice of medicine to a more tailored one. This personalized medicine revolution would not be possible today without the genetics data collected from genome-wide association studies (GWASs) that investigate the relation between different phenotypic traits and single-nucleotide polymorphisms (SNPs). The huge increase in the literature publication space imposes a challenge on the conventional manual curation process which is becoming more and more expensive. This research aims at automatically extracting SNP associations of any given disease and its reported statistical significance (P-value) and odd ratio as well as cohort information such as size and ethnicity. Our evaluation illustrates that SNPcurator was able to replicate a large number of SNP-disease associations that were also reported in the NHGRI-EBI Catalog of published GWASs. SNPcurator was also tested by eight external genetics experts, who queried the system to examine diseases of their choice, and was found to be efficient and satisfactory. We conclude that the text-mining-based system has a great potential for helping researchers and scientists, especially in their preliminary genetics research. SNPcurator is publicly available at http://snpcurator.science.uu.nl/. Database URL: http://snpcurator.science.uu.nl/
Collapse
Affiliation(s)
- Noha S Tawfik
- Computer Engineering Department, College of Engineering, Arab Academy for Science, Technology, and Maritime Transport (AAST), Abukir,1029 Alexandria, Egypt.,Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| | - Marco R Spruit
- Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| |
Collapse
|
11
|
Wei CH, Phan L, Feltz J, Maiti R, Hefferon T, Lu Z. tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics 2018; 34:80-87. [PMID: 28968638 DOI: 10.1093/bioinformatics/btx541] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 08/31/2017] [Indexed: 11/12/2022] Open
Abstract
Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact zhiyong.lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Lon Phan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Juliana Feltz
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Rama Maiti
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Tim Hefferon
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
12
|
Xie T, Akbar S, Stathopoulou MG, Oster T, Masson C, Yen FT, Visvikis-Siest S. Epistatic interaction of apolipoprotein E and lipolysis-stimulated lipoprotein receptor genetic variants is associated with Alzheimer's disease. Neurobiol Aging 2018; 69:292.e1-292.e5. [PMID: 29858039 DOI: 10.1016/j.neurobiolaging.2018.04.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2017] [Revised: 04/25/2018] [Accepted: 04/25/2018] [Indexed: 01/19/2023]
Abstract
The ε4 allele of the apolipoprotein E (APOE) gene common polymorphism is the strongest genetic risk factor for Alzheimer's disease (AD). Human APOE gene is located on chromosome 19q13.1, a region linked to AD that also includes the LSR gene, which encodes the lipolysis-stimulated lipoprotein receptor (LSR). As an APOE receptor, LSR is involved in the regulation of lipid homeostasis in both periphery and brain. This study aimed to determine the potential interactions between 2 LSR genetic variants, rs34259399 and rs916147, and the APOE common polymorphism in 142 AD subjects (mean age: 73.16 ± 8.50 years) and 63 controls (mean age: 70.41 ± 8.49 years). A significant epistatic interaction was observed between APOE and both LSR variants, rs34259399 (beta = -0.95; p = 2 × 10-5) and rs916147 (beta = -0.83; p = 6.8 × 10-3). Interestingly, the interaction of LSR polymorphisms with APOE non-ε4 alleles increased AD risk. This indicates the existence of complex molecular interactions between these 2 neighboring genes involved in the pathogenesis of AD, which merits further investigation.
Collapse
Affiliation(s)
- Ting Xie
- UMR INSERM U1122; Université de Lorraine, Inserm, IGE-PCV, Nancy, France
| | - Samina Akbar
- UMR INSERM U1122; Université de Lorraine, Inserm, IGE-PCV, Nancy, France
| | | | - Thierry Oster
- EA3998 INRA USC 0340 UR AFPA, Université de Lorraine, 2 ave de la Forêt de Haye, Vandœuvre-lès-Nancy, France
| | - Christine Masson
- UMR INSERM U1122; Université de Lorraine, Inserm, IGE-PCV, Nancy, France
| | - Frances T Yen
- EA3998 INRA USC 0340 UR AFPA, Université de Lorraine, 2 ave de la Forêt de Haye, Vandœuvre-lès-Nancy, France
| | - Sophie Visvikis-Siest
- UMR INSERM U1122; Université de Lorraine, Inserm, IGE-PCV, Nancy, France; Department of Internal Medicine and Geriatrics, CHU Nancy-Brabois, Nancy, France.
| |
Collapse
|
13
|
Kohailan M, Alanazi M, Rouabhia M, Al Amri A, Parine NR, Semlali A. Two SNPs in the promoter region of Toll-like receptor 4 gene are not associated with smoking in Saudi Arabia. Onco Targets Ther 2017; 10:745-752. [PMID: 28223830 PMCID: PMC5308598 DOI: 10.2147/ott.s111971] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Defects in the innate immune system, particularly in Toll-like receptors (TLRs), have been reported in several cigarette smoke-promoted diseases. The aim of this study was to examine the impact of tobacco smoke on allelic frequencies of TLR4 single-nucleotide polymorphisms (SNPs) and to compare the genotypic distribution of these SNPs in a Saudi Arabian population with that in previously studied populations. DNA was extracted from 303 saliva samples collected from smokers and nonsmokers. Two transitional SNPs in the promoter region of TLR4 were selected, rs2770150 (T/C) and rs10759931 (G/A). Genotype frequencies were determined using quantitative polymerase chain reaction. Our results showed a slight effect of smoking on the distribution of rs2770150 and rs10759931. However, the differences were not significant. Thus, we conclude that the SNPs selected for this study were independent of smoking and may not be related to smoking-induced diseases.
Collapse
Affiliation(s)
- Muhammad Kohailan
- Department of Biochemistry, College of Science, King Saud University, Riyadh, Kingdom of Saudi Arabia
| | - Mohammad Alanazi
- Department of Biochemistry, College of Science, King Saud University, Riyadh, Kingdom of Saudi Arabia
| | - Mahmoud Rouabhia
- Groupe de Recherche en Écologie Buccale, Département de Stomatologie, Faculté de Médecine Dentaire, Université Laval, Québec, QC, Canada
| | - Abdullah Al Amri
- Department of Biochemistry, College of Science, King Saud University, Riyadh, Kingdom of Saudi Arabia
| | - Narasimha Reddy Parine
- Department of Biochemistry, College of Science, King Saud University, Riyadh, Kingdom of Saudi Arabia
| | - Abdelhabib Semlali
- Department of Biochemistry, College of Science, King Saud University, Riyadh, Kingdom of Saudi Arabia
| |
Collapse
|
14
|
Computational Analysis of Damaging Single-Nucleotide Polymorphisms and Their Structural and Functional Impact on the Insulin Receptor. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2023803. [PMID: 27840822 PMCID: PMC5093252 DOI: 10.1155/2016/2023803] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2016] [Accepted: 09/14/2016] [Indexed: 12/31/2022]
Abstract
Single-nucleotide polymorphisms (SNPs) associated with complex disorders can create, destroy, or modify protein coding sites. Single amino acid substitutions in the insulin receptor (INSR) are the most common forms of genetic variations that account for various diseases like Donohue syndrome or Leprechaunism, Rabson-Mendenhall syndrome, and type A insulin resistance. We analyzed the deleterious nonsynonymous SNPs (nsSNPs) in INSR gene based on different computational methods. Analysis of INSR was initiated with PROVEAN followed by PolyPhen and I-Mutant servers to investigate the effects of 57 nsSNPs retrieved from database of SNP (dbSNP). A total of 18 mutations that were found to exert damaging effects on the INSR protein structure and function were chosen for further analysis. Among these mutations, our computational analysis suggested that 13 nsSNPs decreased protein stability and might have resulted in loss of function. Therefore, the probability of their involvement in disease predisposition increases. In the lack of adequate prior reports on the possible deleterious effects of nsSNPs, we have systematically analyzed and characterized the functional variants in coding region that can alter the expression and function of INSR gene. In silico characterization of nsSNPs affecting INSR gene function can aid in better understanding of genetic differences in disease susceptibility.
Collapse
|
15
|
Verspoor KM, Heo GE, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inform Decis Mak 2016; 16 Suppl 1:68. [PMID: 27454860 PMCID: PMC4959367 DOI: 10.1186/s12911-016-0294-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.
Collapse
Affiliation(s)
- Karin M Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Go Eun Heo
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Keun Young Kang
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea.
| |
Collapse
|
16
|
Yu W, Li Y, Wang Z, Liu L, Liu J, Ding F, Zhang X, Cheng Z, Chen P, Dou J. Transcriptomic changes in human renal proximal tubular cells revealed under hypoxic conditions by RNA sequencing. Int J Mol Med 2016; 38:894-902. [PMID: 27432315 DOI: 10.3892/ijmm.2016.2677] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 07/07/2016] [Indexed: 11/05/2022] Open
Abstract
Chronic hypoxia often occurs among patients with chronic kidney disease (CKD). Renal proximal tubular cells may be the primary target of a hypoxic insult. However, the underlying transcriptional mechanisms remain undefined. In this study, we revealed the global changes in gene expression in HK‑2 human renal proximal tubular cells under hypoxic and normoxic conditions. We analyzed the transcriptome of HK‑2 cells exposed to hypoxia for 24 h using RNA sequencing. A total of 279 differentially expressed genes was examined, as these genes could potentially explain the differences in HK‑2 cells between hypoxic and normoxic conditions. Moreover, 17 genes were validated by qPCR, and the results were highly concordant with the RNA seqencing results. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed to better understand the functions of these differentially expressed genes. The upregulated genes appeared to be significantly enriched in the pathyway of extracellular matrix (ECM)-receptor interaction, and in paticular, the pathway of renal cell carcinoma was upregulated under hypoxic conditions. The downregulated genes were enriched in the signaling pathway related to antigen processing and presentation; however, the pathway of glutathione metabolism was downregulated. Our analysis revealed numerous novel transcripts and alternative splicing events. Simultaneously, we also identified a large number of single nucleotide polymorphisms, which will be a rich resource for future marker development. On the whole, our data indicate that transcriptome analysis provides valuable information for a more in depth understanding of the molecular mechanisms in CKD and renal cell carcinoma.
Collapse
Affiliation(s)
- Wenmin Yu
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Yiping Li
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Zhi Wang
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Lei Liu
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Jing Liu
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Fengan Ding
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Xiaoyi Zhang
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Zhengyuan Cheng
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Pingsheng Chen
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| | - Jun Dou
- Medical School of Southeast University, Nanjing, Jiangsu 210009, P.R. China
| |
Collapse
|
17
|
Thomas P, Rocktäschel T, Hakenberg J, Lichtblau Y, Leser U. SETH detects and normalizes genetic variants in text. Bioinformatics 2016; 32:2883-5. [PMID: 27256315 DOI: 10.1093/bioinformatics/btw234] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2015] [Accepted: 04/18/2016] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Philippe Thomas
- Language Technology Lab, DFKI Berlin, Germany Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | | | - Jörg Hakenberg
- Illumina, Inc, 451 El Camino Real, Santa Clara, CA 95050, USA
| | - Yvonne Lichtblau
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität Zu Berlin, Unter Den Linden 6, Berlin 10099, Germany
| |
Collapse
|
18
|
Mahmood ASMA, Wu TJ, Mazumder R, Vijay-Shanker K. DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One 2016; 11:e0152725. [PMID: 27073839 PMCID: PMC4830514 DOI: 10.1371/journal.pone.0152725] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 03/19/2016] [Indexed: 11/22/2022] Open
Abstract
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
Collapse
Affiliation(s)
- A. S. M. Ashique Mahmood
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Tsung-Jung Wu
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
| | - Raja Mazumder
- Department of Biochemistry and Molecular Medicine, George Washington University, Washington, District of Columbia, United States of America
- McCormick Genomic and Proteomic Center, George Washington University, Washington, District of Columbia, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
19
|
Su J, Su J, Shang X, Wan Q, Chen X, Rao Y. SNP detection of TLR8 gene, association study with susceptibility/resistance to GCRV and regulation on mRNA expression in grass carp, Ctenopharyngodon idella. FISH & SHELLFISH IMMUNOLOGY 2015; 43:1-12. [PMID: 25514376 DOI: 10.1016/j.fsi.2014.12.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Revised: 11/17/2014] [Accepted: 12/06/2014] [Indexed: 05/10/2023]
Abstract
Toll-like receptor 8 (TLR8), a prototypical intracellular member of TLR family, is generally linked closely to antiviral innate immune through recognizing viral nucleic acid. In this study, 5'-flanking region of Ctenopharyngodon idella TLR8 (CiTLR8), 671bp in length, was amplified and eight SNPs containing one SNP in the intron, three SNPs in the coding region (CDS) and four SNPs in the 3'-untranslated region (UTR) were identified and characterized. Of which 4062 A/T was significantly associated with the susceptibility/resistance to GCRV both in genotype and allele (P < 0.05), while 4168 C/T was extremely significantly associated with that (P < 0.01) according to the case (susceptibility)-control (resistance) analysis. Following the verification experiment, further analyses of mRNA expression, linkage disequilibrium (LD), haplotype and microRNA (miRNA) target site indicated that 4062 A/T and 4168 C/T in 3'-UTR might affect the miRNA regulation, while the exertion of antiviral effects of 4062 A/T might rely on its interaction with other SNPs. Additionally, the high-density of SNPs in 3'-UTR might reflect the specific biological functions of 3'-UTR. And also, the mutation of 747 A/G in intron changing the potential transcriptional factor-binding sites (TFBS) nearby might affect the expression of CiTLR8 transcriptionally or post-transcriptionally. Moreover, as predicted, the A/G transition of the only non-synonymous SNP (3846 A/G) in CDS causing threonine/alanine variation, could shorten the length of the α-helix and ultimately affect the integrity of the Toll-IL-1 receptor (TIR) domain. The functional mechanism of 3846 A/G might also involve a threonine phosphorylation signaling. This study may broaden the knowledge of TLR polymorphisms, lay the foundation for further functional research of CiTLR8 and provide potential markers as well as theoretical basis for resistance molecular breeding of grass carp against GCRV.
Collapse
Affiliation(s)
- Juanjuan Su
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Jianguo Su
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China.
| | - Xueying Shang
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Quanyuan Wan
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Xiaohui Chen
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Youliang Rao
- College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| |
Collapse
|
20
|
Macintyre G, Jimeno Yepes A, Ong CS, Verspoor K. Associating disease-related genetic variants in intergenic regions to the genes they impact. PeerJ 2014; 2:e639. [PMID: 25374782 PMCID: PMC4217187 DOI: 10.7717/peerj.639] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2014] [Accepted: 10/07/2014] [Indexed: 11/20/2022] Open
Abstract
We present a method to assist in interpretation of the functional impact of intergenic disease-associated SNPs that is not limited to search strategies proximal to the SNP. The method builds on two sources of external knowledge: the growing understanding of three-dimensional spatial relationships in the genome, and the substantial repository of information about relationships among genetic variants, genes, and diseases captured in the published biomedical literature. We integrate chromatin conformation capture data (HiC) with literature support to rank putative target genes of intergenic disease-associated SNPs. We demonstrate that this hybrid method outperforms a genomic distance baseline on a small test set of expression quantitative trait loci, as well as either method individually. In addition, we show the potential for this method to uncover relationships between intergenic SNPs and target genes across chromosomes. With more extensive chromatin conformation capture data becoming readily available, this method provides a way forward towards functional interpretation of SNPs in the context of the three dimensional structure of the genome in the nucleus.
Collapse
Affiliation(s)
- Geoff Macintyre
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Centre for Neural Engineering, The University of Melbourne, VIC, Australia
| | - Antonio Jimeno Yepes
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
| | - Cheng Soon Ong
- Department of Electrical and Electronic Engineering, The University of Melbourne, VIC, Australia
- Machine Learning Group, NICTA Canberra Research Laboratory, Australia
- Research School of Computer Science, Australian National University, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, VIC, Australia
- Health and Biomedical Informatics Centre, The University of Melbourne, VIC, Australia
| |
Collapse
|
21
|
Beyan T, Aydın Son Y. Incorporation of personal single nucleotide polymorphism (SNP) data into a national level electronic health record for disease risk assessment, part 1: an overview of requirements. JMIR Med Inform 2014; 2:e15. [PMID: 25599712 PMCID: PMC4288081 DOI: 10.2196/medinform.3169] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2013] [Revised: 05/25/2014] [Accepted: 07/02/2014] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Personalized medicine approaches provide opportunities for predictive and preventive medicine. Using genomic, clinical, environmental, and behavioral data, tracking and management of individual wellness is possible. A prolific way to carry this personalized approach into routine practices can be accomplished by integrating clinical interpretations of genomic variations into electronic medical records (EMRs)/electronic health records (EHRs). Today, various central EHR infrastructures have been constituted in many countries of the world including Turkey. OBJECTIVE The objective of this study was to concentrate on incorporating the personal single nucleotide polymorphism (SNP) data into the National Health Information System of Turkey (NHIS-T) for disease risk assessment, and evaluate the performance of various predictive models for prostate cancer cases. We present our work as a miniseries containing three parts: (1) an overview of requirements, (2) the incorporation of SNP into the NHIS-T, and (3) an evaluation of SNP incorporated NHIS-T for prostate cancer. METHODS For the first article of this miniseries, the scientific literature is reviewed and the requirements of SNP data integration into EMRs/EHRs are extracted and presented. RESULTS In the literature, basic requirements of genomic-enabled EMRs/EHRs are listed as incorporating genotype data and its clinical interpretation into EMRs/EHRs, developing accurate and accessible clinicogenomic interpretation resources (knowledge bases), interpreting and reinterpreting of variant data, and immersing of clinicogenomic information into the medical decision processes. In this section, we have analyzed these requirements under the subtitles of terminology standards, interoperability standards, clinicogenomic knowledge bases, defining clinical significance, and clinicogenomic decision support. CONCLUSIONS In order to integrate structured genotype and phenotype data into any system, there is a need to determine data components, terminology standards, and identifiers of clinicogenomic information. Also, we need to determine interoperability standards to share information between different information systems of stakeholders, and develop decision support capability to interpret genomic variations based on the knowledge bases via different assessment approaches.
Collapse
Affiliation(s)
- Timur Beyan
- Informatics Institute, Department of Health Informatics, Middle East Technical University, Ankara, Turkey
| | | |
Collapse
|
22
|
Abstract
Collection of documents annotated with semantic entities and relationships are crucial resources to support development and evaluation of text mining solutions for the biomedical domain. Here I present an overview of 36 corpora and show an analysis on the semantic annotations they contain. Annotations for entity types were classified into six semantic groups and an overview on the semantic entities which can be found in each corpus is shown. Results show that while some semantic entities, such as genes, proteins and chemicals are consistently annotated in many collections, corpora available for diseases, variations and mutations are still few, in spite of their importance in the biological domain.
Collapse
Affiliation(s)
- Mariana Neves
- Hasso-Plattner-Institut, Potsdam Universität, Potsdam, Germany
| |
Collapse
|
23
|
The Relationship between ALA16VAL Single Gene Polymorphism and Renal Cell Carcinoma. Adv Urol 2014; 2014:932481. [PMID: 24587799 PMCID: PMC3920972 DOI: 10.1155/2014/932481] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2013] [Accepted: 12/02/2013] [Indexed: 12/12/2022] Open
Abstract
Objectives. The aim of this study was to investigate the association of RCC and Ala16Val polymorphism in Turkish patients with RCC. Materials and Methods. A total of 41 patients with RCC who underwent radical or partial nephrectomy in our clinic and 50 healthy volunteers living in the same geographic area were included in this study. DNA samples from serum of RCC patients and controls were genotyped for MnSOD polymorphism analysis. Genotype ratios and allele frequencies were compared between two groups and odd ratios with 95% confidence intervals were calculated statistically. A P value of <0.05 was considered statistically significant. Results. There was a significant difference in the MnSOD genotype distributions between the RCC patients and the controls in terms of Ala/Ala+Ala/Val and Val/Val genotypes (P = 0.039). The Ala/Ala+Ala/Val genotypes were found significantly suspicious for RCC with an OR of 2.64 (95% CI = 1.06–6.69, P = 0.039). In addition, Ala allele was found significantly suspicious for RCC with an OR of 2.26 (95% CI = 1.24–4.12, P = 0.009). Conclusion. Our study indicated that MnSOD Ala16Val polymorphism may be one of the many genetic factors for renal cancer susceptibility in Turkish patients.
Collapse
|
24
|
Jimeno Yepes A, Verspoor K. Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Res 2014; 3:18. [PMID: 25285203 PMCID: PMC4176422 DOI: 10.12688/f1000research.3-18.v2] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/27/2014] [Indexed: 11/20/2022] Open
Abstract
As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.
Collapse
Affiliation(s)
- Antonio Jimeno Yepes
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- National ICT Australia, Victoria Research Laboratory, Melbourne, Australia ; Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
25
|
Vívenes M, Castro de Guerra D, Rodríguez-Larralde Á, Arocha-Piñango CL, Guerrero B. Activity and levels of factor XIII in a Venezuelan admixed population: association with rs5985 (Val35Leu) and STR F13A01 polymorphisms. Thromb Res 2012; 130:729-34. [DOI: 10.1016/j.thromres.2012.07.027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Revised: 07/19/2012] [Accepted: 07/31/2012] [Indexed: 11/16/2022]
|
26
|
Thomas P, Starlinger J, Vowinkel A, Arzt S, Leser U. GeneView: a comprehensive semantic search engine for PubMed. Nucleic Acids Res 2012; 40:W585-91. [PMID: 22693219 PMCID: PMC3394277 DOI: 10.1093/nar/gks563] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Research results are primarily published in scientific literature and curation efforts cannot keep up with the rapid growth of published literature. The plethora of knowledge remains hidden in large text repositories like MEDLINE. Consequently, life scientists have to spend a great amount of time searching for specific information. The enormous ambiguity among most names of biomedical objects such as genes, chemicals and diseases often produces too large and unspecific search results. We present GeneView, a semantic search engine for biomedical knowledge. GeneView is built upon a comprehensively annotated version of PubMed abstracts and openly available PubMed Central full texts. This semi-structured representation of biomedical texts enables a number of features extending classical search engines. For instance, users may search for entities using unique database identifiers or they may rank documents by the number of specific mentions they contain. Annotation is performed by a multitude of state-of-the-art text-mining tools for recognizing mentions from 10 entity classes and for identifying protein-protein interactions. GeneView currently contains annotations for >194 million entities from 10 classes for ∼21 million citations with 271,000 full text bodies. GeneView can be searched at http://bc3.informatik.hu-berlin.de/.
Collapse
Affiliation(s)
- Philippe Thomas
- Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
| | | | | | | | | |
Collapse
|