1
|
Suzuki T, Bono H. A systematic exploration of unexploited genes for oxidative stress in Parkinson's disease. NPJ Parkinsons Dis 2024; 10:160. [PMID: 39154038 PMCID: PMC11330442 DOI: 10.1038/s41531-024-00776-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 08/05/2024] [Indexed: 08/19/2024] Open
Abstract
Human disease-associated gene data are accessible through databases, including the Open Targets Platform, DisGeNET, miRTex, RNADisease, and PubChem. However, missing data entries in such databases are anticipated because of curational errors, biases, and text-mining failures. Additionally, the extensive research on human diseases has led to challenges in registering comprehensive data. The lack of essential data in databases hinders knowledge sharing and should be addressed. Therefore, we propose an analysis pipeline to explore missing entries of unexploited genes in the human disease-associated gene databases. Using this pipeline for genes in Parkinson's disease with oxidative stress revealed two unexploited genes: nuclear protein 1 (NUPR1) and ubiquitin-like with PHD and ring finger domains 2 (UHRF2). This methodology enhances the identification of underrepresented disease-associated genes, facilitating easier access to potential human disease-related functional genes. This study aims to identify unexploited genes for further research and does not include independent experimental validation.
Collapse
Affiliation(s)
- Takayuki Suzuki
- Graduate School of Integrated Sciences for Life, Hiroshima University, 3-10-23 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-0046, Japan
| | - Hidemasa Bono
- Graduate School of Integrated Sciences for Life, Hiroshima University, 3-10-23 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-0046, Japan.
- Genome Editing Innovation Center, Hiroshima University, 3-10-23 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-0046, Japan.
- Database Center for Life Science (DBCLS), Joint Support-Center for Data Science Research, Research Organization of Information and Systems (ROIS), 178-4-4 Wakashiba, Kashiwa, Chiba, 277-0871, Japan.
| |
Collapse
|
2
|
Madan S, Kühnel L, Fröhlich H, Hofmann-Apitius M, Fluck J. Dataset of miRNA-disease relations extracted from textual data using transformer-based neural networks. Database (Oxford) 2024; 2024:baae066. [PMID: 39104284 PMCID: PMC11300841 DOI: 10.1093/database/baae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 06/23/2024] [Accepted: 07/10/2024] [Indexed: 08/07/2024]
Abstract
MicroRNAs (miRNAs) play important roles in post-transcriptional processes and regulate major cellular functions. The abnormal regulation of expression of miRNAs has been linked to numerous human diseases such as respiratory diseases, cancer, and neurodegenerative diseases. Latest miRNA-disease associations are predominantly found in unstructured biomedical literature. Retrieving these associations manually can be cumbersome and time-consuming due to the continuously expanding number of publications. We propose a deep learning-based text mining approach that extracts normalized miRNA-disease associations from biomedical literature. To train the deep learning models, we build a new training corpus that is extended by distant supervision utilizing multiple external databases. A quantitative evaluation shows that the workflow achieves an area under receiver operator characteristic curve of 98% on a holdout test set for the detection of miRNA-disease associations. We demonstrate the applicability of the approach by extracting new miRNA-disease associations from biomedical literature (PubMed and PubMed Central). We have shown through quantitative analysis and evaluation on three different neurodegenerative diseases that our approach can effectively extract miRNA-disease associations not yet available in public databases. Database URL: https://zenodo.org/records/10523046.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
| | - Lisa Kühnel
- Knowledge Management, German National Library of Medicine (ZB MED)—Information Centre for Life Sciences, Friedrich-Hirzebruch-Allee 4, Bonn 53115, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, Bielefeld, Nordrhein-Westfalen 33501, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Friedrich-Hirzebruch-Allee 6, Bonn 53113, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Friedrich-Hirzebruch-Allee 6, Bonn 53113, Germany
| | - Juliane Fluck
- Knowledge Management, German National Library of Medicine (ZB MED)—Information Centre for Life Sciences, Friedrich-Hirzebruch-Allee 4, Bonn 53115, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, Bielefeld, Nordrhein-Westfalen 33501, Germany
- Information management, Institute of Geodesy and Geoinformation, University of Bonn, Katzenburgweg 1a, Bonn 53115, Germany
| |
Collapse
|
3
|
Bhasuran B, Manoharan S, Iyyappan OR, Murugesan G, Prabahar A, Raja K. Large Language Models and Genomics for Summarizing the Role of microRNA in Regulating mRNA Expression. Biomedicines 2024; 12:1535. [PMID: 39062108 PMCID: PMC11274411 DOI: 10.3390/biomedicines12071535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 06/30/2024] [Accepted: 07/03/2024] [Indexed: 07/28/2024] Open
Abstract
microRNA (miRNA)-messenger RNA (mRNA or gene) interactions are pivotal in various biological processes, including the regulation of gene expression, cellular differentiation, proliferation, apoptosis, and development, as well as the maintenance of cellular homeostasis and pathogenesis of numerous diseases, such as cancer, cardiovascular diseases, neurological disorders, and metabolic conditions. Understanding the mechanisms of miRNA-mRNA interactions can provide insights into disease mechanisms and potential therapeutic targets. However, extracting these interactions efficiently from a huge collection of published articles in PubMed is challenging. In the current study, we annotated a miRNA-mRNA Interaction Corpus (MMIC) and used it for evaluating the performance of a variety of machine learning (ML) models, deep learning-based transformer (DLT) models, and large language models (LLMs) in extracting the miRNA-mRNA interactions mentioned in PubMed. We used the genomics approaches for validating the extracted miRNA-mRNA interactions. Among the ML, DLT, and LLM models, PubMedBERT showed the highest precision, recall, and F-score, with all equal to 0.783. Among the LLM models, the performance of Llama-2 is better when compared to others. Llama 2 achieved 0.56 precision, 0.86 recall, and 0.68 F-score in a zero-shot experiment and 0.56 precision, 0.87 recall, and 0.68 F-score in a three-shot experiment. Our study shows that Llama 2 achieves better recall than ML and DLT models and leaves space for further improvement in terms of precision and F-score.
Collapse
Affiliation(s)
- Balu Bhasuran
- School of Information, Florida State University, Tallahassee, FL 32306, USA;
| | - Sharanya Manoharan
- Department of Bioinformatics, Stella Maris College, Chennai 600086, Tamil Nadu, India;
| | - Oviya Ramalakshmi Iyyappan
- Department of Computer Science and Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Chennai 641112, Tamil Nadu, India;
| | - Gurusamy Murugesan
- Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Green Fields, Guntur District, Vaddeswaram 522302, Andhra Pradesh, India;
| | - Archana Prabahar
- Center for Gene Regulation in Health and Disease, Department of Biological, Geological, and Environmental Sciences (BGES), Cleveland State University, Cleveland, OH 44115, USA;
| | - Kalpana Raja
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, USA
| |
Collapse
|
4
|
Yang TH, Yu YH, Wu SH, Chang FY, Tsai HC, Yang YC. DMLS: an automated pipeline to extract the Drosophila modular transcription regulators and targets from massive literature articles. Database (Oxford) 2024; 2024:0. [PMID: 38900628 PMCID: PMC11188685 DOI: 10.1093/database/baae049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 05/10/2024] [Accepted: 05/31/2024] [Indexed: 06/22/2024]
Abstract
Transcription regulation in multicellular species is mediated by modular transcription factor (TF) binding site combinations termed cis-regulatory modules (CRMs). Such CRM-mediated transcription regulation determines the gene expression patterns during development. Biologists frequently investigate CRM transcription regulation on gene expressions. However, the knowledge of the target genes and regulatory TFs participating in the CRMs under study is mostly fragmentary throughout the literature. Researchers need to afford tremendous human resources to fully surf through the articles deposited in biomedical literature databases in order to obtain the information. Although several novel text-mining systems are now available for literature triaging, these tools do not specifically focus on CRM-related literature prescreening, failing to correctly extract the information of the CRM target genes and regulatory TFs from the literature. For this reason, we constructed a supportive auto-literature prescreener called Drosophila Modular transcription-regulation Literature Screener (DMLS) that achieves the following: (i) prescreens articles describing experiments on modular transcription regulation, (ii) identifies the described target genes and TFs of the CRMs under study for each modular transcription-regulation-describing article and (iii) features an automated and extendable pipeline to perform the task. We demonstrated that the final performance of DMLS in extracting the described target gene and regulatory TF lists of CRMs under study for given articles achieved test macro area under the ROC curve (auROC) = 89.7% and area under the precision-recall curve (auPRC) = 77.6%, outperforming the intuitive gene name-occurrence-counting method by at least 19.9% in auROC and 30.5% in auPRC. The web service and the command line versions of DMLS are available at https://cobis.bme.ncku.edu.tw/DMLS/ and https://github.com/cobisLab/DMLS/, respectively. Database Tool URL: https://cobis.bme.ncku.edu.tw/DMLS/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
- Medical Device Innovation Center, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Yu-Huai Yu
- Department of Biomedical Engineering, National Cheng Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Sheng-Hang Wu
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Fang-Yuan Chang
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Hsiu-Chun Tsai
- Department of Information Management, National Central University, No. 300, Zhongda RD., Zhongli District, Taoyuan 320, Taiwan
| | - Ya-Chiao Yang
- Institute of Information Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd., East Dist., Hsinchu 300093,Taiwan
| |
Collapse
|
5
|
Literature Mining of Disease Associated Noncoding RNA in the Omics Era. Molecules 2022; 27:molecules27154710. [PMID: 35897884 PMCID: PMC9331993 DOI: 10.3390/molecules27154710] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/20/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open
Abstract
Noncoding RNAs (ncRNA) are transcripts without protein-coding potential that play fundamental regulatory roles in diverse cellular processes and diseases. The application of deep sequencing experiments in ncRNA research have generated massive omics datasets, which require rapid examination, interpretation and validation based on exiting knowledge resources. Thus, text-mining methods have been increasingly adapted for automatic extraction of relations between an ncRNA and its target or a disease condition from biomedical literature. These bioinformatics tools can also assist in more complex research, such as database curation of candidate ncRNAs and hypothesis generation with respect to pathophysiological mechanisms. In this concise review, we first introduced basic concepts and workflow of literature mining systems. Then, we compared available bioinformatics tools tailored for ncRNA studies, including the tasks, applicability, and limitations. Their powerful utilities and flexibility are demonstrated by examples in a variety of diseases, such as Alzheimer’s disease, atherosclerosis and cancers. Finally, we outlined several challenges from the viewpoints of both system developers and end users. We concluded that the application of text-mining techniques will booster disease-associated ncRNA discoveries in the biomedical literature and enable integrative biology in the current omics era.
Collapse
|
6
|
Begum Y. Regulatory role of microRNAs (miRNAs) in the recent development of abiotic stress tolerance of plants. Gene 2022; 821:146283. [PMID: 35143944 DOI: 10.1016/j.gene.2022.146283] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 01/12/2022] [Accepted: 02/03/2022] [Indexed: 12/21/2022]
Abstract
MicroRNAs (miRNAs) are a distinct groups of single-stranded non-coding, tiny regulatory RNAs approximately 20-24 nucleotides in length. miRNAs negatively influence gene expression at the post-transcriptional level and have evolved considerably in the development of abiotic stress tolerance in a number of model plants and economically important crop species. The present review aims to deliver the information on miRNA-mediated regulation of the expression of major genes or Transcription Factors (TFs), as well as genetic and regulatory pathways. Also, the information on adaptive mechanisms involved in plant abiotic stress responses, prediction, and validation of targets, computational tools, and databases available for plant miRNAs, specifically focus on their exploration for engineering abiotic stress tolerance in plants. The regulatory function of miRNAs in plant growth, development, and abiotic stresses consider in this review, which uses high-throughput sequencing (HTS) technologies to generate large-scale libraries of small RNAs (sRNAs) for conventional screening of known and novel abiotic stress-responsive miRNAs adds complexity to regulatory networks in plants. The discoveries of miRNA-mediated tolerance to multiple abiotic stresses, including salinity, drought, cold, heat stress, nutritional deficiency, UV-radiation, oxidative stress, hypoxia, and heavy metal toxicity, are highlighted and discussed in this review.
Collapse
Affiliation(s)
- Yasmin Begum
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, 92, APC Road, Kolkata 700009, West Bengal, India; Center of Excellence in Systems Biology and Biomedical Engineering (TEQIP Phase-III), University of Calcutta, JD-2, Sector III, Salt Lake, Kolkata 700106, West Bengal, India.
| |
Collapse
|
7
|
Park Y, West RA, Pathmendra P, Favier B, Stoeger T, Capes-Davis A, Cabanac G, Labbé C, Byrne JA. Identification of human gene research articles with wrongly identified nucleotide sequences. Life Sci Alliance 2022; 5:e202101203. [PMID: 35022248 PMCID: PMC8807875 DOI: 10.26508/lsa.202101203] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 12/27/2021] [Accepted: 12/28/2021] [Indexed: 01/01/2023] Open
Abstract
Nucleotide sequence reagents underpin molecular techniques that have been applied across hundreds of thousands of publications. We have previously reported wrongly identified nucleotide sequence reagents in human research publications and described a semi-automated screening tool Seek & Blastn to fact-check their claimed status. We applied Seek & Blastn to screen >11,700 publications across five literature corpora, including all original publications in Gene from 2007 to 2018 and all original open-access publications in Oncology Reports from 2014 to 2018. After manually checking Seek & Blastn outputs for >3,400 human research articles, we identified 712 articles across 78 journals that described at least one wrongly identified nucleotide sequence. Verifying the claimed identities of >13,700 sequences highlighted 1,535 wrongly identified sequences, most of which were claimed targeting reagents for the analysis of 365 human protein-coding genes and 120 non-coding RNAs. The 712 problematic articles have received >17,000 citations, including citations by human clinical trials. Given our estimate that approximately one-quarter of problematic articles may misinform the future development of human therapies, urgent measures are required to address unreliable gene research articles.
Collapse
Affiliation(s)
- Yasunori Park
- Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - Rachael A West
- Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
- Children's Cancer Research Unit, Kids Research, The Children's Hospital at Westmead, Westmead, Australia
| | | | - Bertrand Favier
- Université Grenoble Alpes, Translationnelle et Innovation en Médecine et Complexité, Grenoble, France
| | - Thomas Stoeger
- Successful Clinical Response in Pneumonia Therapy Systems Biology Center, Northwestern University, Evanston, IL, USA
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
- Center for Genetic Medicine, Northwestern University School of Medicine, Chicago, IL, USA
| | - Amanda Capes-Davis
- Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
- CellBank Australia, Children's Medical Research Institute, Westmead, Australia
| | - Guillaume Cabanac
- Computer Science Department, Institut de Recherche en Informatique de Toulouse, Unité Mixte de Recherche 5505 Centre National de la Recherche Scientifique (CNRS), University of Toulouse, Toulouse, France
| | - Cyril Labbé
- Université Grenoble Alpes, CNRS, Grenoble INP, Laboratoire d'Informatique de Grenoble, Grenoble, France
| | - Jennifer A Byrne
- Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
- New South Wales Health Statewide Biobank, New South Wales Health Pathology, Camperdown, Australia
| |
Collapse
|
8
|
Friedrich J, Hammes HP, Krenning G. miRetrieve-an R package and web application for miRNA text mining. NAR Genom Bioinform 2021; 3:lqab117. [PMID: 34988440 PMCID: PMC8696973 DOI: 10.1093/nargab/lqab117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 11/01/2021] [Accepted: 12/03/2021] [Indexed: 12/30/2022] Open
Abstract
microRNAs (miRNAs) regulate gene expression and thereby influence biological processes in health and disease. As a consequence, miRNAs are intensely studied and literature on miRNAs has been constantly growing. While this growing body of literature reflects the interest in miRNAs, it generates a challenge to maintain an overview, and the comparison of miRNAs that may function across diverse disease fields is complex due to this large number of relevant publications. To address these challenges, we designed miRetrieve, an R package and web application that provides an overview on miRNAs. By text mining, miRetrieve can characterize and compare miRNAs within specific disease fields and across disease areas. This overview provides focus and facilitates the generation of new hypotheses. Here, we explain how miRetrieve works and how it is used. Furthermore, we demonstrate its applicability in an exemplary case study and discuss its advantages and disadvantages.
Collapse
Affiliation(s)
- Julian Friedrich
- Cardiovascular Regenerative Medicine (CAVAREM), Department of Pathology and Medical Biology, University Medical Center Groningen, University of Groningen, Hanzeplein 1 (EA11), 9713 GZ Groningen, The Netherlands
- 5th Medical Department, Section of Endocrinology, Medical Faculty Mannheim, University of Heidelberg, 68167 Mannheim, Germany
| | - Hans-Peter Hammes
- 5th Medical Department, Section of Endocrinology, Medical Faculty Mannheim, University of Heidelberg, 68167 Mannheim, Germany
- European Center of Angioscience, Medical Faculty Mannheim, University of Heidelberg, 68167 Mannheim, Germany
| | - Guido Krenning
- Cardiovascular Regenerative Medicine (CAVAREM), Department of Pathology and Medical Biology, University Medical Center Groningen, University of Groningen, Hanzeplein 1 (EA11), 9713 GZ Groningen, The Netherlands
| |
Collapse
|
9
|
Bauer C, Herwig R, Lienhard M, Prasse P, Scheffer T, Schuchhardt J. Large-scale literature mining to assess the relation between anti-cancer drugs and cancer types. J Transl Med 2021; 19:274. [PMID: 34174885 PMCID: PMC8236166 DOI: 10.1186/s12967-021-02941-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/13/2021] [Indexed: 12/09/2022] Open
Abstract
Background There is a huge body of scientific literature describing the relation between tumor types and anti-cancer drugs. The vast amount of scientific literature makes it impossible for researchers and physicians to extract all relevant information manually. Methods In order to cope with the large amount of literature we applied an automated text mining approach to assess the relations between 30 most frequent cancer types and 270 anti-cancer drugs. We applied two different approaches, a classical text mining based on named entity recognition and an AI-based approach employing word embeddings. The consistency of literature mining results was validated with 3 independent methods: first, using data from FDA approvals, second, using experimentally measured IC-50 cell line data and third, using clinical patient survival data. Results We demonstrated that the automated text mining was able to successfully assess the relation between cancer types and anti-cancer drugs. All validation methods showed a good correspondence between the results from literature mining and independent confirmatory approaches. The relation between most frequent cancer types and drugs employed for their treatment were visualized in a large heatmap. All results are accessible in an interactive web-based knowledge base using the following link: https://knowledgebase.microdiscovery.de/heatmap. Conclusions Our approach is able to assess the relations between compounds and cancer types in an automated manner. Both, cancer types and compounds could be grouped into different clusters. Researchers can use the interactive knowledge base to inspect the presented results and follow their own research questions, for example the identification of novel indication areas for known drugs. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-021-02941-z.
Collapse
Affiliation(s)
- Chris Bauer
- MicroDiscovery GmbH, Marienburger Straße 1, 10405, Berlin, Germany.
| | - Ralf Herwig
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63, 14195, Berlin, Germany
| | - Matthias Lienhard
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63, 14195, Berlin, Germany
| | - Paul Prasse
- Department of Informatics, University of Potsdam, August-Bebel-Str. 89, 14482, Potsdam, Germany
| | - Tobias Scheffer
- Department of Informatics, University of Potsdam, August-Bebel-Str. 89, 14482, Potsdam, Germany
| | | |
Collapse
|
10
|
Roychowdhury D, Gupta S, Qin X, Arighi CN, Vijay-Shanker K. emiRIT: a text-mining-based resource for microRNA information. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6287648. [PMID: 34048547 PMCID: PMC8163238 DOI: 10.1093/database/baab031] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 03/15/2021] [Accepted: 05/04/2021] [Indexed: 01/18/2023]
Abstract
microRNAs (miRNAs) are essential gene regulators, and their dysregulation often leads to diseases. Easy access to miRNA information is crucial for interpreting generated experimental data, connecting facts across publications and developing new hypotheses built on previous knowledge. Here, we present extracting miRNA Information from Text (emiRIT), a text-miningbased resource, which presents miRNA information mined from the literature through a user-friendly interface. We collected 149 ,233 miRNA –PubMed ID pairs from Medline between January 1997 and May 2020. emiRIT currently contains ‘miRNA –gene regulation’ (69 ,152 relations), ‘miRNA disease (cancer)’ (12 ,300 relations), ‘miRNA –biological process and pathways’ (23, 390 relations) and circulatory ‘miRNAs in extracellular locations’ (3782 relations). Biological entities and their relation to miRNAs were extracted from Medline abstracts using publicly available and in-house developed text-mining tools, and the entities were normalized to facilitate querying and integration. We built a database and an interface to store and access the integrated data, respectively. We provide an up-to-date and user-friendly resource to facilitate access to comprehensive miRNA information from the literature on a large scale, enabling users to navigate through different roles of miRNA and examine them in a context specific to their information needs. To assess our resource’s information coverage, we have conducted two case studies focusing on the target and differential expression information of miRNAs in the context of cancer and a third case study to assess the usage of emiRIT in the curation of miRNA information. Database URL: https://research.bioinformatics.udel.edu/emirit/
Collapse
Affiliation(s)
- Debarati Roychowdhury
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, 18 Amstel Ave, Newark, DE 19716, USA
| | - Samir Gupta
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, 18 Amstel Ave, Newark, DE 19716, USA
| | - Xihan Qin
- Department of Computer and Information Sciences, Center of Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Room 205, Newark, DE 19711, USA
| | - Cecilia N Arighi
- Department of Computer and Information Sciences, Center of Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Room 205, Newark, DE 19711, USA
| | - K Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, 18 Amstel Ave, Newark, DE 19716, USA
| |
Collapse
|
11
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
12
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
13
|
Wu C, Tong L, Wu C, Chen D, Chen J, Li Q, Jia F, Huang Z. Two miRNA prognostic signatures of head and neck squamous cell carcinoma: A bioinformatic analysis based on the TCGA dataset. Cancer Med 2020; 9:2631-2642. [PMID: 32064753 PMCID: PMC7163094 DOI: 10.1002/cam4.2915] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Revised: 12/28/2019] [Accepted: 01/27/2020] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs(miRNAs) are maladjusted in multifarious malignant tumor and can be considered as both carcinogens and tumor‐inhibiting factor. In the present study, we analyzed the miRNAs expression profiles and clinical information of 481 patients with head and neck squamous cell carcinoma (HNSCC) through the TCGA dataset to identify the prognostic miRNAs signature. A total of 114 significantly differentially expressed miRNAs (SDEMs) were identified, consisting of 60 up‐adjusted and 54 down‐adjusted miRNAs. The Kaplan‐Meier survival method identified the prognostic function of 2 miRNAs (miR‐4652‐5p and miR‐99a‐3P). Univariate and multivariate Cox regression analyses indicated that the 2 miRNAs were significant prognostic elements of HNSCC. Furthermore, bioinformatic analysis was conducted by means of 4 online gene predicted toolkits to recognize the target genes, and enrichment analysis was performed on the target genes by DAVID. The outcomes depicted that target genes were correlated with calcium, as well as cell proliferation, circadian entrainment, EGFR, PI3K‐Akt‐mTOR, and P53 signaling pathways. Finally, the PPI network was conducted in view of STRING database and Cytoscape. Eight hub genes were identified by CytoHubba and MCODE app, respectively, CBL, SKP1, H2AFX, HGF, POLR2F, UBE2I, VAMP2, and GNAI2 genes. As a result, we identified 2 miRNAs signatures, 8 hub genes, and significant signaling pathways for estimating the prognosis of HNSCC. In order to further explore the molecular mechanism of HNSCC occurrence and development, more comprehensive basic and clinical studies are needed.
Collapse
Affiliation(s)
- Chaoying Wu
- Department of Otolaryngology Head and Neck SurgeryThe First Affiliated Hospital of Jinzhou Medical UniversityJinzhouLiaoningChina
| | - Lingxia Tong
- Department of UltrasoundJilin Cancer HospitalChangchunJilinChina
| | - Chaoqun Wu
- Department of General MedicineNingbo Medical Center Lihuili HospitalNingboZhejiangChina
| | - Dong Chen
- Department of Otolaryngology Head and Neck SurgeryThe First Affiliated Hospital of Jinzhou Medical UniversityJinzhouLiaoningChina
| | - Jianguo Chen
- Department of Otolaryngology Head and Neck SurgeryThe First Affiliated Hospital of Jinzhou Medical UniversityJinzhouLiaoningChina
| | - Qianyun Li
- Department of Otolaryngology Head and Neck SurgeryThe First Affiliated Hospital of Jinzhou Medical UniversityJinzhouLiaoningChina
| | - Fang Jia
- Department of Otolaryngology Head and Neck SurgeryThe First Affiliated Hospital of Jinzhou Medical UniversityJinzhouLiaoningChina
| | - Zirui Huang
- Jinzhou Medical UniversityJinzhouLiaoningChina
| |
Collapse
|
14
|
Lever J, Jones MR, Danos AM, Krysiak K, Bonakdar M, Grewal JK, Culibrk L, Griffith OL, Griffith M, Jones SJM. Text-mining clinically relevant cancer biomarkers for curation into the CIViC database. Genome Med 2019; 11:78. [PMID: 31796060 PMCID: PMC6891984 DOI: 10.1186/s13073-019-0686-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 11/07/2019] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature. METHODS To aid in this curation and provide the greatest coverage for these databases, particularly CIViC, we propose the use of text mining approaches to extract these clinically relevant biomarkers from all available published literature. To this end, a group of cancer genomics experts annotated sentences that discussed biomarkers with their clinical associations and achieved good inter-annotator agreement. We then used a supervised learning approach to construct the CIViCmine knowledgebase. RESULTS We extracted 121,589 relevant sentences from PubMed abstracts and PubMed Central Open Access full-text papers. CIViCmine contains over 87,412 biomarkers associated with 8035 genes, 337 drugs, and 572 cancer types, representing 25,818 abstracts and 39,795 full-text publications. CONCLUSIONS Through integration with CIVIC, we provide a prioritized list of curatable clinically relevant cancer biomarkers as well as a resource that is valuable to other knowledgebases and precision cancer analysts in general. All data is publically available and distributed with a Creative Commons Zero license. The CIViCmine knowledgebase is available at http://bionlp.bcgsc.ca/civicmine/.
Collapse
Affiliation(s)
- Jake Lever
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
- University of British Columbia, Vancouver, BC, Canada
| | - Martin R Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Arpad M Danos
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Kilannin Krysiak
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA
| | - Melika Bonakdar
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Jasleen K Grewal
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
- University of British Columbia, Vancouver, BC, Canada
| | - Luka Culibrk
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
- University of British Columbia, Vancouver, BC, Canada
| | - Obi L Griffith
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.
- Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA.
- Division of Oncology, Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA.
| | - Malachi Griffith
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.
- Siteman Cancer Center, Washington University School of Medicine, St. Louis, MO, USA.
- Division of Oncology, Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA.
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada.
- University of British Columbia, Vancouver, BC, Canada.
- Simon Fraser University, Burnaby, BC, Canada.
| |
Collapse
|
15
|
Giorgi JM, Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 2019; 34:4087-4094. [PMID: 29868832 PMCID: PMC6247938 DOI: 10.1093/bioinformatics/bty449] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/29/2018] [Indexed: 01/08/2023] Open
Abstract
Motivation The explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER. Results We demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target datasets with a small number of labels (approximately 6000 or less). Availability and implementation Source code for the LSTM-CRF is available at https://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available at https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- John M Giorgi
- Department of Computer Science, University of Toronto, Toronto, Canada.,The Donnelly Centre, University of Toronto, Toronto, Canada
| | - Gary D Bader
- Department of Computer Science, University of Toronto, Toronto, Canada.,The Donnelly Centre, University of Toronto, Toronto, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
16
|
Computational Resources for Prediction and Analysis of Functional miRNA and Their Targetome. Methods Mol Biol 2019; 1912:215-250. [PMID: 30635896 DOI: 10.1007/978-1-4939-8982-9_9] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
microRNAs are evolutionarily conserved, endogenously produced, noncoding RNAs (ncRNAs) of approximately 19-24 nucleotides (nts) in length known to exhibit gene silencing of complementary target sequence. Their deregulated expression is reported in various disease conditions and thus has therapeutic implications. In the last decade, various computational resources are published in this field. In this chapter, we have reviewed bioinformatics resources, i.e., miRNA-centered databases, algorithms, and tools to predict miRNA targets. First section has enlisted more than 75 databases, which mainly covers information regarding miRNA registries, targets, disease associations, differential expression, interactions with other noncoding RNAs, and all-in-one resources. In the algorithms section, we have compiled about 140 algorithms from eight subcategories, viz. for the prediction of precursor (pre-) and mature miRNAs. These algorithms are developed on various sequence, structure, and thermodynamic based features incorporated into different machine learning techniques (MLTs). In addition, computational identification of miRNAs from high-throughput next generation sequencing (NGS) data and their variants, viz. isomiRs, differential expression, miR-SNPs, and functional annotation, are discussed. Prediction and analysis of miRNAs and their associated targets are also evaluated under miR-targets section providing knowledge regarding novel miRNA targets and complex host-pathogen interactions. In conclusion, we have provided comprehensive review of in silico resources published in miRNA research to help scientific community be updated and choose the appropriate tool according to their needs.
Collapse
|
17
|
Dai HJ, Wang CK, Chang NW, Huang MS, Jonnagaddala J, Wang FD, Hsu WL. Statistical principle-based approach for recognizing and normalizing microRNAs described in scientific literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5365313. [PMID: 30809637 PMCID: PMC6391575 DOI: 10.1093/database/baz030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Revised: 02/01/2019] [Accepted: 02/06/2019] [Indexed: 01/08/2023]
Abstract
The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, ROC
| | - Chen-Kai Wang
- Big Data Laboratories, Chunghwa Telecom Co., Taoyuan, Taiwan, ROC
| | - Nai-Wen Chang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.,Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Ming-Siang Huang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, University of New South Wales, Sydney, Australia
| | - Feng-Duo Wang
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
18
|
Chen T, Wu M, Li H. A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5645655. [PMID: 31800044 PMCID: PMC6892305 DOI: 10.1093/database/baz116] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2019] [Revised: 07/16/2019] [Accepted: 09/02/2019] [Indexed: 01/07/2023]
Abstract
The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.
Collapse
Affiliation(s)
- Tao Chen
- Department of Computer Science and Engineering, Faculty of Intelligent Manufacturing, Wuyi University, No.22, Dongcheng village, Pengjiang district, Jiangmen City, Guangdong Province, 529020, China
| | - Mingfen Wu
- Department of Computer Science and Engineering, Faculty of Intelligent Manufacturing, Wuyi University, No.22, Dongcheng village, Pengjiang district, Jiangmen City, Guangdong Province, 529020, China
| | - Hexi Li
- Department of Computer Science and Engineering, Faculty of Intelligent Manufacturing, Wuyi University, No.22, Dongcheng village, Pengjiang district, Jiangmen City, Guangdong Province, 529020, China
| |
Collapse
|
19
|
Gupta A, Ragumani S, Sharma YK, Ahmad Y, Khurana P. Analysis of Hypoxiamir-Gene Regulatory Network Identifies Critical MiRNAs Influencing Cell-Cycle Regulation Under Hypoxic Conditions. Microrna 2019; 8:223-236. [PMID: 30806334 DOI: 10.2174/2211536608666190219094204] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 01/14/2019] [Accepted: 02/06/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND Hypoxia is a pathophysiological condition which arises due to low oxygen concentration in conditions like cardiovascular diseases, inflammation, ascent to higher altitude, malignancies, deep sea diving, prenatal birth, etc. A number of microRNAs (miRNAs), Transcription Factors (TFs) and genes have been studied separately for their role in hypoxic adaptation and controlling cell-cycle progression and apoptosis during this stress. OBJECTIVE We hypothesize that miRNAs and TFs may act in conjunction to regulate a multitude of genes and play a crucial and combinatorial role during hypoxia-stress-responses and associated cellcycle control mechanisms. METHOD We collected a comprehensive and non-redundant list of human hypoxia-responsive miRNAs (also known as hypoxiamiRs). Their experimentally validated gene-targets were retrieved from various databases and a comprehensive hypoxiamiR-gene regulatory network was built. RESULTS Functional characterization and pathway enrichment of genes identified phospho-proteins as enriched nodes. The phospho-proteins which were localized both in the nucleus and cytoplasm and could potentially play important role as signaling molecules were selected; and further pathway enrichment revealed that most of them were involved in NFkB signaling. Topological analysis identified several critical hypoxiamiRs and network perturbations confirmed their importance in the network. Feed Forward Loops (FFLs) were identified in the subnetwork of enriched genes, miRNAs and TFs. Statistically significant FFLs consisted of four miRNAs (hsa-miR-182-5p, hsa- miR-146b-5p, hsa-miR-96, hsa-miR-20a) and three TFs (SMAD4, FOXO1, HIF1A) both regulating two genes (NFkB1A and CDKN1A). CONCLUSION Detailed BioCarta pathway analysis identified that these miRNAs and TFs together play a critical and combinatorial role in regulating cell-cycle under hypoxia, by controlling mechanisms that activate cell-cycle checkpoint protein, CDKN1A. These modules work synergistically to regulate cell-proliferation, cell-growth, cell-differentiation and apoptosis during hypoxia. A detailed mechanistic molecular model of how these co-regulatory FFLs may regulate the cell-cycle transitions during hypoxic stress conditions is also put forth. These biomolecules may play a crucial and deterministic role in deciding the fate of the cell under hypoxic-stress.
Collapse
Affiliation(s)
- Apoorv Gupta
- Defence Institute of Physiology and Allied Sciences (DIPAS), Defence R&D Organization (DRDO), Timarpur, Delhi- 110054, India
| | - Sugadev Ragumani
- Defence Institute of Physiology and Allied Sciences (DIPAS), Defence R&D Organization (DRDO), Timarpur, Delhi- 110054, India
| | - Yogendra Kumar Sharma
- Defence Institute of Physiology and Allied Sciences (DIPAS), Defence R&D Organization (DRDO), Timarpur, Delhi- 110054, India
| | - Yasmin Ahmad
- Defence Institute of Physiology and Allied Sciences (DIPAS), Defence R&D Organization (DRDO), Timarpur, Delhi- 110054, India
| | - Pankaj Khurana
- Defence Institute of Physiology and Allied Sciences (DIPAS), Defence R&D Organization (DRDO), Timarpur, Delhi- 110054, India
| |
Collapse
|
20
|
Fereshteh Z, Schmidt SA, Al-Dossary AA, Accerbi M, Arighi C, Cowart J, Song JL, Green PJ, Choi K, Yoo S, Martin-DeLeon PA. Murine Oviductosomes (OVS) microRNA profiling during the estrous cycle: Delivery of OVS-borne microRNAs to sperm where miR-34c-5p localizes at the centrosome. Sci Rep 2018; 8:16094. [PMID: 30382141 PMCID: PMC6208369 DOI: 10.1038/s41598-018-34409-4] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Accepted: 10/16/2018] [Indexed: 12/11/2022] Open
Abstract
Oviductosomes (OVS) are nano-sized extracellular vesicles secreted in the oviductal luminal fluid by oviductal epithelial cells and known to be involved in sperm capacitation and fertility. Although they have been shown to transfer encapsulated proteins to sperm, cargo constituents other than proteins have not been identified. Using next-generation sequencing, we demonstrate that OVS are carriers of microRNAs (miRNAs), with 272 detected throughout the estrous cycle. Of the 50 most abundant, 6 (12%) and 2 (4%) were expressed at significantly higher levels (P < 0.05) at metestrus/diestrus and proestrus/estrus. RT-qPCR showed that selected miRNAs are present in oviductal epithelial cells in significantly (P < 0.05) lower abundance than in OVS, indicating selective miRNA packaging. The majority (64%) of the top 25 OVS miRNAs are present in sperm. These miRNAs’ potential target list is enriched with transcription factors, transcription regulators, and protein kinases and there are several embryonic developmentally-related genes. Importantly, OVS can deliver to sperm miRNAs, including miR-34c-5p which is essential for the first cleavage and is solely sperm-derived in the zygote. Z-stack of confocal images of sperm co-incubated with OVS loaded with labeled miRNAs showed the intracellular location of the delivered miRNAs. Interestingly, individual miRNAs were predominantly localized in specific head compartments, with miR-34c-5p being highly concentrated at the centrosome where it is known to function. These results, for the first time, demonstrate OVS’ ability to contribute to the sperm’s miRNA repertoire (an important role for solely sperm-derived zygotic miRNAs) and the physiological relevance of an OVS-borne miRNA that is delivered to sperm.
Collapse
Affiliation(s)
- Zeinab Fereshteh
- Department of Biological Sciences, University of Delaware, Newark, DE, 19716, USA
| | - Skye A Schmidt
- Department of Plant and Soil Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE, 19711, USA
| | - Amal A Al-Dossary
- Department of Biological Sciences, University of Delaware, Newark, DE, 19716, USA.,Department of Biology, College of Medicine, Imam Abdulrahman Bin Faisal University, P.O. Box 1982, Dammam, 31441, Saudi Arabia
| | - Monica Accerbi
- Department of Plant and Soil Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE, 19711, USA
| | - Cecilia Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA
| | - Julie Cowart
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA
| | - Jia L Song
- Department of Biological Sciences, University of Delaware, Newark, DE, 19716, USA
| | - Pamela J Green
- Department of Plant and Soil Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE, 19711, USA
| | - Kyungmin Choi
- A.I. DuPont Hospital for Children, 1600 Rockland Rd, Wilmington, Delaware, 19803, USA
| | - Soonmoon Yoo
- A.I. DuPont Hospital for Children, 1600 Rockland Rd, Wilmington, Delaware, 19803, USA
| | | |
Collapse
|
21
|
Hu Y, Dingerdissen H, Gupta S, Kahsay R, Shanker V, Wan Q, Yan C, Mazumder R. Identification of key differentially expressed MicroRNAs in cancer patients through pan-cancer analysis. Comput Biol Med 2018; 103:183-197. [PMID: 30384176 DOI: 10.1016/j.compbiomed.2018.10.021] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 10/01/2018] [Accepted: 10/17/2018] [Indexed: 12/16/2022]
Abstract
microRNAs (miRNAs) functioning in gene silencing have been associated with cancer progression. However, common abnormal miRNA expression patterns and their potential roles in cancer have not yet been evaluated. To account for individual differences between patients, we retrieved miRNA sequencing data for 575 patients with both tumor and adjacent non-tumorous tissues from 14 cancer types from The Cancer Genome Atlas (TCGA). We then performed differential expression analysis using DESeq2 and edgeR. Results showed that cancer types can be grouped based on the distribution of miRNAs with different expression patterns between tumor and non-tumor samples. We found 81 significantly differentially expressed miRNAs (SDEmiRNAs) in a single cancer. We also found 21 key SDEmiRNAs (nine over-expressed and 12 under-expressed) associated with at least eight cancers each and enriched in more than 60% of patients per cancer, including four newly identified SDEmiRNAs (hsa-mir-4746, hsa-mir-3648, hsa-mir-3687, and hsa-mir-1269a). The downstream effects of these 21 SDEmiRNAs on cellular function were evaluated through enrichment and pathway analysis of 7186 protein-coding gene targets mined from literature reports of differential expression of miRNAs in cancer. This analysis enables identification of SDEmiRNA functional similarity in cell proliferation control across a wide range of cancers, and assembly of common regulatory networks over cancer-related pathways. These findings were validated by construction of a regulatory network in the PI3K pathway. This study provides evidence for the value of further analysis of SDEmiRNAs as potential biomarkers and therapeutic targets for cancer diagnosis and treatment.
Collapse
Affiliation(s)
- Yu Hu
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA.
| | - Hayley Dingerdissen
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA.
| | - Samir Gupta
- Department of Computer and Information Science, University of Delaware, Newark, DE, 19716, USA.
| | - Robel Kahsay
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA.
| | - Vijay Shanker
- Department of Computer and Information Science, University of Delaware, Newark, DE, 19716, USA.
| | - Quan Wan
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA.
| | - Cheng Yan
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA.
| | - Raja Mazumder
- The Department of Biochemistry & Molecular Medicine, The George Washington University Medical Center, Washington, DC, 20037, USA; The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC, 20037, USA.
| |
Collapse
|
22
|
Onye SC, Akkeleş A, Dimililer N. relSCAN - A system for extracting chemical-induced disease relation from biomedical literature. J Biomed Inform 2018; 87:79-87. [PMID: 30296491 DOI: 10.1016/j.jbi.2018.09.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 09/17/2018] [Accepted: 09/30/2018] [Indexed: 11/20/2022]
Abstract
This paper proposes an effective and robust approach for Chemical-Induced Disease (CID) relation extraction from PubMed articles. The study was performed on the Chemical Disease Relation (CDR) task of BioCreative V track-3 corpus. The proposed system, named relSCAN, is an efficient CID relation extraction system with two phases to classify relation instances from the Co-occurrence and Non-Co-occurrence mention levels. We describe the case of chemical and disease mentions that occur in the same sentence as 'Co-occurrence', or as 'Non-Co-occurrence' otherwise. In the first phase, the relation instances are constructed on both mention levels. In the second phase, we employ a hybrid feature set to classify the relation instances at both of these mention levels using the combination of two Machine Learning (ML) classifiers (Support Vector Machine (SVM) and J48 Decision tree). This system is entirely corpus dependent and does not rely on information from external resources in order to boost its performance. We achieved good results, which are comparable with the other state-of-the-art CID relation extraction systems on the BioCreative V corpus. Furthermore, our system achieves the best performance on the Non-Co-occurrence mention level.
Collapse
Affiliation(s)
- Stanley Chika Onye
- Department of Applied Mathematics and Computer Science, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey.
| | - Arif Akkeleş
- Department of Mathematics, Faculty of Arts & Sciences, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey
| | - Nazife Dimililer
- Department of Information Technology, School of Computing and Technology, Eastern Mediterranean University, Famagusta, North Cyprus via Mersin 10, Turkey
| |
Collapse
|
23
|
|
24
|
Chiu AM, Mitra M, Boymoushakian L, Coller HA. Integrative analysis of the inter-tumoral heterogeneity of triple-negative breast cancer. Sci Rep 2018; 8:11807. [PMID: 30087365 PMCID: PMC6081411 DOI: 10.1038/s41598-018-29992-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 07/18/2018] [Indexed: 02/07/2023] Open
Abstract
Triple-negative breast cancers (TNBC) lack estrogen and progesterone receptors and HER2 amplification, and are resistant to therapies that target these receptors. Tumors from TNBC patients are heterogeneous based on genetic variations, tumor histology, and clinical outcomes. We used high throughput genomic data for TNBC patients (n = 137) from TCGA to characterize inter-tumor heterogeneity. Similarity network fusion (SNF)-based integrative clustering combining gene expression, miRNA expression, and copy number variation, revealed three distinct patient clusters. Integrating multiple types of data resulted in more distinct clusters than analyses with a single datatype. Whereas most TNBCs are classified by PAM50 as basal subtype, one of the clusters was enriched in the non-basal PAM50 subtypes, exhibited more aggressive clinical features and had a distinctive signature of oncogenic mutations, miRNAs and expressed genes. Our analyses provide a new classification scheme for TNBC based on multiple omics datasets and provide insight into molecular features that underlie TNBC heterogeneity.
Collapse
Affiliation(s)
- Alec M Chiu
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA
| | - Mithun Mitra
- Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA.,Department of Biological Chemistry, David Geffen School of Medicine, University of California, Los Angeles, USA
| | - Lari Boymoushakian
- Department of Computer Science, University of California, Los Angeles, USA
| | - Hilary A Coller
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA. .,Department of Molecular, Cell, and Developmental Biology, University of California, Los Angeles, USA. .,Department of Biological Chemistry, David Geffen School of Medicine, University of California, Los Angeles, USA.
| |
Collapse
|
25
|
Koido M, Tani Y, Tsukahara S, Okamoto Y, Tomida A. InDePTH: detection of hub genes for developing gene expression networks under anticancer drug treatment. Oncotarget 2018; 9:29097-29111. [PMID: 30018738 PMCID: PMC6044382 DOI: 10.18632/oncotarget.25624] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 05/19/2018] [Indexed: 01/17/2023] Open
Abstract
It has been difficult to elucidate the structure of gene regulatory networks under anticancer drug treatment. Here, we developed an algorithm to highlight the hub genes that play a major role in creating the upstream and downstream relationships within a given set of differentially expressed genes. The directionality of the relationships between genes was defined using information from comprehensive collections of transcriptome profiles after gene knockdown and overexpression. As expected, among the drug-perturbed genes, our algorithm tended to derive plausible hub genes, such as transcription factors. Our validation experiments successfully showed the anticipated activity of certain hub gene in establishing the gene regulatory network that was associated with cell growth inhibition. Notably, giving such top priority to the hub gene was not achieved by ranking fold change in expression and by the conventional gene set enrichment analysis of drug-induced transcriptome data. Thus, our data-driven approach can facilitate to understand drug-induced gene regulatory networks for finding potential functional genes.
Collapse
Affiliation(s)
- Masaru Koido
- Cancer Chemotherapy Center, Japanese Foundation for Cancer Research, 3-8-31 Ariake, Koto-ku, Tokyo 135-8550, Japan
| | - Yuri Tani
- Cancer Chemotherapy Center, Japanese Foundation for Cancer Research, 3-8-31 Ariake, Koto-ku, Tokyo 135-8550, Japan
| | - Satomi Tsukahara
- Cancer Chemotherapy Center, Japanese Foundation for Cancer Research, 3-8-31 Ariake, Koto-ku, Tokyo 135-8550, Japan
| | - Yuka Okamoto
- Cancer Chemotherapy Center, Japanese Foundation for Cancer Research, 3-8-31 Ariake, Koto-ku, Tokyo 135-8550, Japan
| | - Akihiro Tomida
- Cancer Chemotherapy Center, Japanese Foundation for Cancer Research, 3-8-31 Ariake, Koto-ku, Tokyo 135-8550, Japan
| |
Collapse
|
26
|
Song M, Kim M, Kang K, Kim YH, Jeon S. Application of Public Knowledge Discovery Tool (PKDE4J) to Represent Biomedical Scientific Knowledge. Front Res Metr Anal 2018. [DOI: 10.3389/frma.2018.00007] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
|
27
|
Ren J, Li G, Ross K, Arighi C, McGarvey P, Rao S, Cowart J, Madhavan S, Vijay-Shanker K, Wu CH. iTextMine: integrated text-mining system for large-scale knowledge extraction from the literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5255177. [PMID: 30576489 PMCID: PMC6301332 DOI: 10.1093/database/bay128] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 11/09/2018] [Indexed: 02/07/2023]
Abstract
Numerous efforts have been made for developing text-mining tools to extract information from biomedical text automatically. They have assisted in many biological tasks, such as database curation and hypothesis generation. Text-mining tools are usually different from each other in terms of programming language, system dependency and input/output format. There are few previous works that concern the integration of different text-mining tools and their results from large-scale text processing. In this paper, we describe the iTextMine system with an automated workflow to run multiple text-mining tools on large-scale text for knowledge extraction. We employ parallel processing with dockerized text-mining tools with a standardized JSON output format and implement a text alignment algorithm to solve the text discrepancy for result integration. iTextMine presently integrates four relation extraction tools, which have been used to process all the Medline abstracts and PMC open access full-length articles. The website allows users to browse the text evidence and view integrated results for knowledge discovery through a network view. We demonstrate the utilities of iTextMine with two use cases involving the gene PTEN and breast cancer and the gene SATB1.
Collapse
Affiliation(s)
- Jia Ren
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Gang Li
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Karen Ross
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| | - Cecilia Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA.,Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Peter McGarvey
- Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA.,Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Shruti Rao
- Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Julie Cowart
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Subha Madhavan
- Innovation Center For Biomedical Informatics, Georgetown University, Washington, DC, USA.,Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, USA
| | - K Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA.,Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA.,Protein Information Resource, Georgetown University Medical Center, Washington, DC, USA
| |
Collapse
|
28
|
Balderas-Martínez YI, Rinaldi F, Contreras G, Solano-Lira H, Sánchez-Pérez M, Collado-Vides J, Selman M, Pardo A. Improving biocuration of microRNAs in diseases: a case study in idiopathic pulmonary fibrosis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3748307. [PMID: 28605770 PMCID: PMC5467562 DOI: 10.1093/database/bax030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 03/25/2017] [Indexed: 12/24/2022]
Abstract
MicroRNAs (miRNAs) are small and non-coding RNA molecules that inhibit gene expression posttranscriptionally. They play important roles in several biological processes, and in recent years there has been an interest in studying how they are related to the pathogenesis of diseases. Although there are already some databases that contain information for miRNAs and their relation with illnesses, their curation represents a significant challenge due to the amount of information that is being generated every day. In particular, respiratory diseases are poorly documented in databases, despite the fact that they are of increasing concern regarding morbidity, mortality and economic impacts. In this work, we present the results that we obtained in the BioCreative Interactive Track (IAT), using a semiautomatic approach for improving biocuration of miRNAs related to diseases. Our procedures will be useful to complement databases that contain this type of information. We adapted the OntoGene text mining pipeline and the ODIN curation system in a full-text corpus of scientific publications concerning one specific respiratory disease: idiopathic pulmonary fibrosis, the most common and aggressive of the idiopathic interstitial cases of pneumonia. We curated 823 miRNA text snippets and found a total of 246 miRNAs related to this disease based on our semiautomatic approach with the system OntoGene/ODIN. The biocuration throughput improved by a factor of 12 compared with traditional manual biocuration. A significant advantage of our semiautomatic pipeline is that it can be applied to obtain the miRNAs of all the respiratory diseases and offers the possibility to be used for other illnesses. Database URL http://odin.ccg.unam.mx/ODIN/bc2015-miRNA/.
Collapse
Affiliation(s)
- Yalbi Itzel Balderas-Martínez
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México.,CONACYT-INER Ismael Cosío Villegas, Departamento Investigación, Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Fabio Rinaldi
- Swiss Institute of Bioinformatics and Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, CH-8050 Zurich, Switzerland.,Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Gabriela Contreras
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Hilda Solano-Lira
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Mishael Sánchez-Pérez
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Julio Collado-Vides
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Moisés Selman
- Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Dirección de Investigación Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Annie Pardo
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México
| |
Collapse
|
29
|
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017; 33:i37-i48. [PMID: 28881963 PMCID: PMC5870729 DOI: 10.1093/bioinformatics/btx228] [Citation(s) in RCA: 192] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
MOTIVATION Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT habibima@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Mariana Neves
- Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
| | - David Luis Wiegandt
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
30
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
31
|
Anekalla KR, Courneya JP, Fiorini N, Lever J, Muchow M, Busby B. PubRunner: A light-weight framework for updating text mining results. F1000Res 2017; 6:612. [PMID: 29152221 PMCID: PMC5664974 DOI: 10.12688/f1000research.11389.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/01/2017] [Indexed: 11/20/2022] Open
Abstract
Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
Collapse
Affiliation(s)
| | - J P Courneya
- Health Sciences and Human Services Library, University of Maryland, Baltimore, MD, 21201, USA
| | - Nicolas Fiorini
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Jake Lever
- Canada's Michael Smith Genome Sciences Centre, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - Michael Muchow
- National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Ben Busby
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| |
Collapse
|
32
|
Lamurias A, Clarke LA, Couto FM. Extracting microRNA-gene relations from biomedical literature using distant supervision. PLoS One 2017; 12:e0171929. [PMID: 28263989 PMCID: PMC5338769 DOI: 10.1371/journal.pone.0171929] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Accepted: 01/29/2017] [Indexed: 11/18/2022] Open
Abstract
Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel.
Collapse
Affiliation(s)
- Andre Lamurias
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Luka A. Clarke
- BioISI: Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | | |
Collapse
|
33
|
Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC. A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries. Int J Genomics 2017; 2017:6213474. [PMID: 28331849 PMCID: PMC5346376 DOI: 10.1155/2017/6213474] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/09/2017] [Indexed: 12/13/2022] Open
Abstract
In the past decade, the volume of "omics" data generated by the different high-throughput technologies has expanded exponentially. The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when moving towards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimental techniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providing independent information to interpret and provide biological inference. Text mining (also known as literature mining) is one of the commonly used approaches for automated generation of biological knowledge from the huge number of published articles. In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and information generated from text mining approaches to uncover novel biomedical information.
Collapse
Affiliation(s)
- Kalpana Raja
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Matthew Patrick
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yilin Gao
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Desmond Madu
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yuyang Yang
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Lam C. Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
34
|
Wang Q, Ross KE, Huang H, Ren J, Li G, Vijay-Shanker K, Wu CH, Arighi CN. Analysis of Protein Phosphorylation and Its Functional Impact on Protein-Protein Interactions via Text Mining of the Scientific Literature. Methods Mol Biol 2017; 1558:213-232. [PMID: 28150240 PMCID: PMC5446092 DOI: 10.1007/978-1-4939-6783-4_10] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
Post-translational modifications (PTMs) are one of the main contributors to the diversity of proteoforms in the proteomic landscape. In particular, protein phosphorylation represents an essential regulatory mechanism that plays a role in many biological processes. Protein kinases, the enzymes catalyzing this reaction, are key participants in metabolic and signaling pathways. Their activation or inactivation dictate downstream events: what substrates are modified and their subsequent impact (e.g., activation state, localization, protein-protein interactions (PPIs)). The biomedical literature continues to be the main source of evidence for experimental information about protein phosphorylation. Automatic methods to bring together phosphorylation events and phosphorylation-dependent PPIs can help to summarize the current knowledge and to expose hidden connections. In this chapter, we demonstrate two text mining tools, RLIMS-P and eFIP, for the retrieval and extraction of kinase-substrate-site data and phosphorylation-dependent PPIs from the literature. These tools offer several advantages over a literature search in PubMed as their results are specific for phosphorylation. RLIMS-P and eFIP results can be sorted, organized, and viewed in multiple ways to answer relevant biological questions, and the protein mentions are linked to UniProt identifiers.
Collapse
Affiliation(s)
- Qinghua Wang
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20057, USA
| | - Hongzhan Huang
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Jia Ren
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
| | - Gang Li
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - K Vijay-Shanker
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20057, USA
| | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA.
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA.
| |
Collapse
|
35
|
Peng Y, Wei CH, Lu Z. Improving chemical disease relation extraction with rich features and weakly labeled data. J Cheminform 2016; 8:53. [PMID: 28316651 PMCID: PMC5054544 DOI: 10.1186/s13321-016-0165-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 09/28/2016] [Indexed: 01/08/2023] Open
Abstract
Background Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. Results We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. Conclusions Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
Collapse
Affiliation(s)
- Yifan Peng
- National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| |
Collapse
|
36
|
Roy S, Curry BC, Madahian B, Homayouni R. Prioritization, clustering and functional annotation of MicroRNAs using latent semantic indexing of MEDLINE abstracts. BMC Bioinformatics 2016; 17:350. [PMID: 27766940 PMCID: PMC5073981 DOI: 10.1186/s12859-016-1223-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Background The amount of scientific information about MicroRNAs (miRNAs) is growing exponentially, making it difficult for researchers to interpret experimental results. In this study, we present an automated text mining approach using Latent Semantic Indexing (LSI) for prioritization, clustering and functional annotation of miRNAs. Results For approximately 900 human miRNAs indexed in miRBase, text documents were created by concatenating titles and abstracts of MEDLINE citations which refer to the miRNAs. The documents were parsed and a weighted term-by-miRNA frequency matrix was created, which was subsequently factorized via singular value decomposition to extract pair-wise cosine values between the term (keyword) and miRNA vectors in reduced rank semantic space. LSI enables derivation of both explicit and implicit associations between entities based on word usage patterns. Using miR2Disease as a gold standard, we found that LSI identified keyword-to-miRNA relationships with high accuracy. In addition, we demonstrate that pair-wise associations between miRNAs can be used to group them into categories which are functionally aligned. Finally, term ranking by querying the LSI space with a group of miRNAs enabled annotation of the clusters with functionally related terms. Conclusions LSI modeling of MEDLINE abstracts provides a robust and automated method for miRNA related knowledge discovery. The latest collection of miRNA abstracts and LSI model can be accessed through the web tool miRNA Literature Network (miRLiN) at http://bioinfo.memphis.edu/mirlin. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1223-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sujoy Roy
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA.,Center for Translational Informatics, University of Memphis, Memphis, 38152, USA
| | - Brandon C Curry
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, 38152, USA
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA. .,Center for Translational Informatics, University of Memphis, Memphis, 38152, USA. .,Department of Biology, University of Memphis, Memphis, 38152, USA.
| |
Collapse
|
37
|
STEPICHEVA NADEZDAA, SONG JIAL. Function and regulation of microRNA-31 in development and disease. Mol Reprod Dev 2016; 83:654-74. [PMID: 27405090 PMCID: PMC6040227 DOI: 10.1002/mrd.22678] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2016] [Accepted: 06/29/2016] [Indexed: 12/13/2022]
Abstract
MicroRNAs (miRNAs) are small noncoding RNAs that orchestrate numerous cellular processes both under normal physiological conditions as well as in diseases. This review summarizes the functional roles and transcriptional regulation of the highly evolutionarily conserved miRNA, microRNA-31 (miR-31). miR-31 is an important regulator of embryonic implantation, development, bone and muscle homeostasis, and immune system function. Its own regulation is disrupted during the onset and progression of cancer and autoimmune disorders such as psoriasis and systemic lupus erythematosus. Limited studies suggest that miR-31 is transcriptionally regulated by epigenetics, such as methylation and acetylation, as well as by a number of transcription factors. Overall, miR-31 regulates diverse cellular and developmental processes by targeting genes involved in cell proliferation, apoptosis, cell differentiation, and cell motility. Mol. Reprod. Dev. 83: 654-674, 2016 © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
| | - JIA L. SONG
- Department of Biological Sciences, University of Delaware, Newark, Delaware
| |
Collapse
|