1
|
Altay G, Zapardiel-Gonzalo J, Peters B. RNA-seq preprocessing and sample size considerations for gene network inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.02.522518. [PMID: 36711979 PMCID: PMC9881880 DOI: 10.1101/2023.01.02.522518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Background Gene network inference (GNI) methods have the potential to reveal functional relationships between different genes and their products. Most GNI algorithms have been developed for microarray gene expression datasets and their application to RNA-seq data is relatively recent. As the characteristics of RNA-seq data are different from microarray data, it is an unanswered question what preprocessing methods for RNA-seq data should be applied prior to GNI to attain optimal performance, or what the required sample size for RNA-seq data is to obtain reliable GNI estimates. Results We ran 9144 analysis of 7 different RNA-seq datasets to evaluate 300 different preprocessing combinations that include data transformations, normalizations and association estimators. We found that there was no single best performing preprocessing combination but that there were several good ones. The performance varied widely over various datasets, which emphasized the importance of choosing an appropriate preprocessing configuration before GNI. Two preprocessing combinations appeared promising in general: First, Log-2 TPM (transcript per million) with Variance-stabilizing transformation (VST) and Pearson Correlation Coefficient (PCC) association estimator. Second, raw RNA-seq count data with PCC. Along with these two, we also identified 18 other good preprocessing combinations. Any of these algorithms might perform best in different datasets. Therefore, the GNI performances of these approaches should be measured on any new dataset to select the best performing one for it. In terms of the required biological sample size of RNA-seq data, we found that between 30 to 85 samples were required to generate reliable GNI estimates. Conclusions This study provides practical recommendations on default choices for data preprocessing prior to GNI analysis of RNA-seq data to obtain optimal performance results.
Collapse
Affiliation(s)
- Gökmen Altay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | | | - Bjoern Peters
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| |
Collapse
|
2
|
Basu A, Sarkar A, Bandyopadhyay S, Maulik U. In silico strategies to identify protein-protein interaction modulator in cell-to-cell transmission of SARS CoV2. Transbound Emerg Dis 2022; 69:3896-3905. [PMID: 36379049 DOI: 10.1111/tbed.14760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 07/08/2022] [Accepted: 09/15/2022] [Indexed: 11/16/2022]
Abstract
RNA sequence data from SARS CoV2 patients helps to construct a gene network related to this disease. A detailed analysis of the human host response to SARS CoV2 with expression profiling by high-throughput sequencing has been accomplished with primary human lung epithelial cell lines. Using this data, the clustered gene annotation and gene network construction are performed with the help of the String database. Among the four clusters identified, only 1 with 44 genes could be annotated. Interestingly, this corresponded to basal cells with p = 1.37e - 05, which is relevant for respiratory tract infection. Functional enrichment analysis of genes present in the gene network has been completed using the String database and the Network Analyst tool. Among three types of cell-cell communication, only the anchoring junction between the basal cell membrane and the basal lamina in the host cell is involved in the virus transmission. In this junction point, a hemidesmosome structure plays a vital role in virus spread from one cell to basal lamina in the respiratory tract. In this protein complex structure, different integrin protein molecules of the host cell are used to promote the spread of virus infection into the extracellular matrix. So, small molecular blockers of different anchoring junction proteins, such as integrin alpha 3, integrin beta 1, can provide efficient protection against this deadly viral disease. ORF8 from SARS CoV2 virus can interact with both integrin proteins of human host. By using molecular docking technique, a ternary complex of these three proteins is modelled. Several oligopeptides are predicted as modulators for this ternary complex. In silico analysis of these modulators is very important to develop novel therapeutics for the treatment of SARS CoV2.
Collapse
Affiliation(s)
- Anamika Basu
- Department of Biochemistry, Gurudas College, Kolkata, India
| | - Anasua Sarkar
- Computer Science and Engineering Department, Jadavpur University, Kolkata, India
| | | | - Ujjwal Maulik
- Computer Science and Engineering Department, Jadavpur University, Kolkata, India
| |
Collapse
|
3
|
Sharma PP, Bansal M, Sethi A, Poonam, Pena L, Goel VK, Grishina M, Chaturvedi S, Kumar D, Rathi B. Computational methods directed towards drug repurposing for COVID-19: advantages and limitations. RSC Adv 2021; 11:36181-36198. [PMID: 35492747 PMCID: PMC9043418 DOI: 10.1039/d1ra05320e] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Accepted: 10/07/2021] [Indexed: 12/19/2022] Open
Abstract
Novel coronavirus disease 2019 (COVID-19) has significantly altered the socio-economic status of countries. Although vaccines are now available against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a causative agent for COVID-19, it continues to transmit and newer variants of concern have been consistently emerging world-wide. Computational strategies involving drug repurposing offer a viable opportunity to choose a medication from a rundown of affirmed drugs against distinct diseases including COVID-19. While pandemics impede the healthcare systems, drug repurposing or repositioning represents a hopeful approach in which existing drugs can be remodeled and employed to treat newer diseases. In this review, we summarize the diverse computational approaches attempted for developing drugs through drug repurposing or repositioning against COVID-19 and discuss their advantages and limitations. To this end, we have outlined studies that utilized computational techniques such as molecular docking, molecular dynamic simulation, disease-disease association, drug-drug interaction, integrated biological network, artificial intelligence, machine learning and network medicine to accelerate creation of smart and safe drugs against COVID-19.
Collapse
Affiliation(s)
- Prem Prakash Sharma
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| | - Meenakshi Bansal
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| | - Aaftaab Sethi
- Department of Medicinal Chemistry, National Institute of Pharmaceutical Education and Research (NIPER) Hyderabad India
| | - Poonam
- Department of Chemistry, Miranda House, University of Delhi Delhi 110007 India
| | - Lindomar Pena
- Department of Virology, Aggeu Magalhaes, Institute (IAM), Oswaldo Cruz Foundation (Fiocruz) Recife 50670-420 Pernambuco Brazil
| | - Vijay Kumar Goel
- School of Physical Sciences, Jawaharlal Nehru University New Delhi 110067 India
| | - Maria Grishina
- South Ural State University, Laboratory of Computational Modelling of Drugs Pr. Lenina 76 454080 Russia
| | - Shubhra Chaturvedi
- Division of Cyclotron and Radiopharmaceutical Sciences, Institute of Nuclear Medicine and Allied Sciences New Delhi 110054 India
| | - Dhruv Kumar
- Amity Institute of Molecular Medicine & Stem Cell Research (AIMMSCR), Amity University Uttar Pradesh Noida 201313 India
| | - Brijesh Rathi
- Laboratory For Translational Chemistry and Drug Discovery, Department of Chemistry, Hansraj College, University of Delhi Delhi 110007 India
| |
Collapse
|
4
|
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H. Drug repurposing for COVID-19 via knowledge graph completion. J Biomed Inform 2021; 115:103696. [PMID: 33571675 PMCID: PMC7869625 DOI: 10.1016/j.jbi.2021.103696] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/23/2020] [Accepted: 02/01/2021] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To discover candidate drugs to repurpose for COVID-19 using literature-derived knowledge and knowledge graph completion methods. METHODS We propose a novel, integrative, and neural network-based literature-based discovery (LBD) approach to identify drug candidates from PubMed and other COVID-19-focused research literature. Our approach relies on semantic triples extracted using SemRep (via SemMedDB). We identified an informative and accurate subset of semantic triples using filtering rules and an accuracy classifier developed on a BERT variant. We used this subset to construct a knowledge graph, and applied five state-of-the-art, neural knowledge graph completion algorithms (i.e., TransE, RotatE, DistMult, ComplEx, and STELP) to predict drug repurposing candidates. The models were trained and assessed using a time slicing approach and the predicted drugs were compared with a list of drugs reported in the literature and evaluated in clinical trials. These models were complemented by a discovery pattern-based approach. RESULTS Accuracy classifier based on PubMedBERT achieved the best performance (F1 = 0.854) in identifying accurate semantic predications. Among five knowledge graph completion models, TransE outperformed others (MR = 0.923, Hits@1 = 0.417). Some known drugs linked to COVID-19 in the literature were identified, as well as others that have not yet been studied. Discovery patterns enabled identification of additional candidate drugs and generation of plausible hypotheses regarding the links between the candidate drugs and COVID-19. Among them, five highly ranked and novel drugs (i.e., paclitaxel, SB 203580, alpha 2-antiplasmin, metoclopramide, and oxymatrine) and the mechanistic explanations for their potential use are further discussed. CONCLUSION We showed that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations. Our approach can be generalized to other diseases as well as to other clinical questions. Source code and data are available at https://github.com/kilicogluh/lbd-covid.
Collapse
Affiliation(s)
- Rui Zhang
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA.
| | - Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Dalton Schutte
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA
| | - Andrej Kastrin
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
5
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
6
|
de Campos LM, Cano A, Castellano JG, Moral S. Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0042. [PMID: 31042646 DOI: 10.1515/sagmb-2018-0042] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.
Collapse
Affiliation(s)
- Luis M de Campos
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Andrés Cano
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Javier G Castellano
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Serafín Moral
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| |
Collapse
|
7
|
Kim YH, Song M. A context-based ABC model for literature-based discovery. PLoS One 2019; 14:e0215313. [PMID: 31017923 PMCID: PMC6481912 DOI: 10.1371/journal.pone.0215313] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 03/29/2019] [Indexed: 12/13/2022] Open
Abstract
Background In the literature-based discovery, considerable research has been done based on the ABC model developed by Swanson. ABC model hypothesizes that there is a meaningful relation between entity A extracted from document set 1 and entity C extracted from document set 2 through B entities that appear commonly in both document sets. The results of ABC model are relations among entity A, B, and C, which is referred as paths. A path allows for hypothesizing the relationship between entity A and entity C, or helps discover entity B as a new evidence for the relationship between entity A and entity C. The co-occurrence based approach of ABC model is a well-known approach to automatic hypothesis generation by creating various paths. However, the co-occurrence based ABC model has a limitation, in that biological context is not considered. It focuses only on matching of B entity which commonly appears in relation between two entities. Therefore, the paths extracted by the co-occurrence based ABC model tend to include a lot of irrelevant paths, meaning that expert verification is essential. Methods In order to overcome this limitation of the co-occurrence based ABC model, we propose a context-based approach to connecting one entity relation to another, modifying the ABC model using biological contexts. In this study, we defined four biological context elements: cell, drug, disease, and organism. Based on these biological context, we propose two extended ABC models: a context-based ABC model and a context-assignment-based ABC model. In order to measure the performance of the both proposed models, we examined the relevance of the B entities between the well-known relations “APOE–MAPT” as well as “FUS–TARDBP”. Each relation means interaction between neurodegenerative disease associated with proteins. The interaction between APOE and MAPT is known to play a crucial role in Alzheimer’s disease as APOE affects tau-mediated neurodegeneration. It has been shown that mutation in FUS and TARDBP are associated with amyotrophic lateral sclerosis(ALS), a motor neuron disease by leading to neuronal cell death. Using these two relations, we compared both of proposed models to co-occurrence based ABC model. Results The precision of B entities by co-occurrence based ABC model was 27.1% for “APOE–MAPT” and 22.1% for “FUS–TARDBP”, respectively. In context-based ABC model, precision of extracted B entities was 71.4% for “APOE–MAPT”, and 77.9% for “FUS–TARDBP”. Context-assignment based ABC model achieved 89% and 97.5% precision for the two relations, respectively. Both proposed models achieved a higher precision than co-occurrence-based ABC model.
Collapse
Affiliation(s)
- Yong Hwan Kim
- Division of Humanities, CheongJu University, CheongJu, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea
- * E-mail:
| |
Collapse
|
8
|
Ko Y, Kim J, Rodriguez-Zas SL. Markov chain Monte Carlo simulation of a Bayesian mixture model for gene network inference. Genes Genomics 2019; 41:547-555. [PMID: 30741379 DOI: 10.1007/s13258-019-00789-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Accepted: 01/21/2019] [Indexed: 12/31/2022]
Abstract
BACKGROUND Simultaneous measurement of gene expression level for thousands of genes contains the rich information about many different aspects of biological mechanisms. A major computational challenge is to find methods to extract new biological insights from this wealth of data. Complex biological processes are often regulated under the various conditions or circumstances and associated gene interactions are dynamically changed depending on different biological contexts. Thus, inference of such dynamic relationships between genes with consideration of biological conditions is very challenging. METHOD In this study, we propose a comprehensive and integrated approach to infer the dynamic relationships between genes and evaluate this approach on three distinct gene networks. RESULTS This study demonstrates the advantage of integrating Markov chain Monte Carlo (MCMC) simulation into a Bayesian mixture model to overcome the high-dimension, low sample size (HDLSS) problem as well as to identify context-specific biological modules. Such biological modules were identified through the summarization of sampled network structures obtained from MCMC simulation. CONCLUSION This novel approach gives a comprehensive understanding of the dynamically regulated biological modules.
Collapse
Affiliation(s)
- Younhee Ko
- Division of Biomedical Engineering, Hankuk University of Foreign Studies, Gyeonggi-do, 17035, South Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, South Korea.
| | - Sandra L Rodriguez-Zas
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA.
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA.
| |
Collapse
|
9
|
Chen G, Jia Y, Zhu L, Li P, Zhang L, Tao C, Jim Zheng W. Gene fingerprint model for literature based detection of the associations among complex diseases: a case study of COPD. BMC Med Inform Decis Mak 2019; 19:20. [PMID: 30700303 PMCID: PMC6354331 DOI: 10.1186/s12911-019-0738-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Disease comorbidity is very common and has significant impact on disease treatment. Revealing the associations among diseases may help to understand the mechanisms of diseases, improve the prevention and treatment of diseases, and support the discovery of new drugs or new uses of existing drugs. METHODS In this paper, we introduced a mathematical model to represent gene related diseases with a series of associated genes based on the overrepresentation of genes and diseases in PubMed literature. We also illustrated an efficient way to reveal the implicit connections between COPD and other diseases based on this model. RESULTS We applied this approach to analyze the relationships between Chronic Obstructive Pulmonary Disease (COPD) and other diseases under the Lung diseases branch in the Medical subject heading index system and detected 4 novel diseases relevant to COPD. As judged by domain experts, the F score of our approach is up to 77.6%. CONCLUSIONS The results demonstrate the effectiveness of the gene fingerprint model for diseases on the basis of medical literature.
Collapse
Affiliation(s)
- Guocai Chen
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Yuxi Jia
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, Jilin, 130021 China
| | - Lisha Zhu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Ping Li
- Department of Development Pediatrics, The Second Affiliated Hospital of Jilin University, Changchun, Jilin, 130041 China
| | - Lin Zhang
- Department of Respiratory Medicine, The Second Affiliated Hospital of Jilin University, Changchun, Jilin, 130041 China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - W. Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| |
Collapse
|
10
|
Chen G, Ramírez JC, Deng N, Qiu X, Wu C, Zheng WJ, Wu H. Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5289627. [PMID: 30649296 PMCID: PMC6333964 DOI: 10.1093/database/bay145] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Accepted: 12/11/2018] [Indexed: 11/14/2022]
Abstract
Motivation Gene Expression Omnibus (GEO) and other publicly available data store their metadata in the format of unstructured English text, which is very difficult for automated reuse. Results We employed text mining techniques to analyze the metadata of GEO and developed Restructured GEO database (ReGEO). ReGEO reorganizes and categorizes GEO series and makes them searchable by two new attributes extracted automatically from each series' metadata. These attributes are the number of time points tested in the experiment and the disease being investigated. ReGEO also makes series searchable by other attributes available in GEO, such as platform organism, experiment type, associated PubMed ID as well as general keywords in the study's description. Our approach greatly expands the usability of GEO data, demonstrating a credible approach to improve the utility of vast amount of publicly available data in the era of Big Data research.
Collapse
Affiliation(s)
- Guocai Chen
- School of Biomedical Informatics, University of Texas Health Science Center at Houston (UTHealth), Houston, Texas, USA
| | | | - Nan Deng
- Department of Biostatistics & Data Science, School of Public Health, University of Texas Health Science Center at Houston (UTHealth), Houston, Texas, USA
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, School of Medicine and Dentistry, University of Rochester, Rochester, New York, USA
| | - Canglin Wu
- TechWave International. Inc., Houston, Texas, USA
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston (UTHealth), Houston, Texas, USA
| | - Hulin Wu
- Department of Biostatistics & Data Science, School of Public Health, University of Texas Health Science Center at Houston (UTHealth), Houston, Texas, USA
| |
Collapse
|
11
|
Kilicoglu H. Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 2018; 19:1400-1414. [PMID: 28633401 PMCID: PMC6291799 DOI: 10.1093/bib/bbx057] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 04/10/2017] [Indexed: 01/01/2023] Open
Abstract
An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, US National Library of Medicine
| |
Collapse
|
12
|
Roy S, Yun D, Madahian B, Berry MW, Deng LY, Goldowitz D, Homayouni R. Navigating the Functional Landscape of Transcription Factors via Non-Negative Tensor Factorization Analysis of MEDLINE Abstracts. Front Bioeng Biotechnol 2017; 5:48. [PMID: 28894735 PMCID: PMC5581332 DOI: 10.3389/fbioe.2017.00048] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 07/31/2017] [Indexed: 01/09/2023] Open
Abstract
In this study, we developed and evaluated a novel text-mining approach, using non-negative tensor factorization (NTF), to simultaneously extract and functionally annotate transcriptional modules consisting of sets of genes, transcription factors (TFs), and terms from MEDLINE abstracts. A sparse 3-mode term × gene × TF tensor was constructed that contained weighted frequencies of 106,895 terms in 26,781 abstracts shared among 7,695 genes and 994 TFs. The tensor was decomposed into sub-tensors using non-negative tensor factorization (NTF) across 16 different approximation ranks. Dominant entries of each of 2,861 sub-tensors were extracted to form term–gene–TF annotated transcriptional modules (ATMs). More than 94% of the ATMs were found to be enriched in at least one KEGG pathway or GO category, suggesting that the ATMs are functionally relevant. One advantage of this method is that it can discover potentially new gene–TF associations from the literature. Using a set of microarray and ChIP-Seq datasets as gold standard, we show that the precision of our method for predicting gene–TF associations is significantly higher than chance. In addition, we demonstrate that the terms in each ATM can be used to suggest new GO classifications to genes and TFs. Taken together, our results indicate that NTF is useful for simultaneous extraction and functional annotation of transcriptional regulatory networks from unstructured text, as well as for literature based discovery. A web tool called Transcriptional Regulatory Modules Extracted from Literature (TREMEL), available at http://binf1.memphis.edu/tremel, was built to enable browsing and searching of ATMs.
Collapse
Affiliation(s)
- Sujoy Roy
- Bioinformatics Program, University of Memphis, Memphis, TN, United States.,Center for Translational Informatics, University of Memphis, Memphis, TN, United States
| | - Daqing Yun
- Computer and Information Sciences Program, Harrisburg University of Science and Technology, Harrisburg, PA, United States
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, United States
| | - Michael W Berry
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, United States
| | - Lih-Yuan Deng
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, United States
| | - Daniel Goldowitz
- Center for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, TN, United States.,Center for Translational Informatics, University of Memphis, Memphis, TN, United States.,Department of Biological Sciences, University of Memphis, Memphis, TN, United States
| |
Collapse
|
13
|
Ding YP, Ladeiro Y, Morilla I, Bouhnik Y, Marah A, Zaag H, Cazals-Hatem D, Seksik P, Daniel F, Hugot JP, Wainrib G, Tréton X, Ogier-Denis E. Integrative Network-based Analysis of Colonic Detoxification Gene Expression in Ulcerative Colitis According to Smoking Status. J Crohns Colitis 2017; 11:474-484. [PMID: 27702825 DOI: 10.1093/ecco-jcc/jjw179] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 10/03/2016] [Indexed: 02/08/2023]
Abstract
BACKGROUNDS AND AIMS The effect of cigarette smoking [CS] is ambivalent since smoking improves ulcerative colitis [UC] while it worsens Crohn's disease [CD]. Although this clinical relationship between inflammatory bowel disease [IBD] and tobacco is well established, only a few experimental works have investigated the effect of smoking on the colonic barrier homeostasis focusing on xenobiotic detoxification genes. METHODS A comprehensive and integrated comparative analysis of the global xenobiotic detoxification capacity of the normal colonic mucosa of healthy smokers [n = 8] and non-smokers [n = 9] versus the non-affected colonic mucosa of UC patients [n = 19] was performed by quantitative real-time polymerase chain reaction [qRT PCR]. The detoxification gene expression profile was analysed in CD patients [n = 18], in smoking UC patients [n = 5], and in biopsies from non-smoking UC patients cultured or not with cigarette smoke extract [n = 8]. RESULTS Of the 244 detoxification genes investigated, 65 were dysregulated in UC patients in comparison with healthy controls or CD patients. The expression of ≥ 45/65 genes was inversed by CS in biopsies of smoking UC patients in remission and in colonic explants of UC patients exposed to cigarette smoke extract. We devised a network-based data analysis approach for differentially assessing changes in genetic interactions, allowing identification of unexpected regulatory detoxification genes that may play a major role in the beneficial effect of smoking on UC. CONCLUSIONS Non-inflamed colonic mucosa in UC is characterised by a specifically altered detoxification gene network, which is partially restored by tobacco. These mucosal signatures could be useful for developing new therapeutic strategies and biomarkers of drug response in UC.
Collapse
Affiliation(s)
- Yong-Ping Ding
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France
| | - Yannick Ladeiro
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France
| | - Ian Morilla
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France
| | - Yoram Bouhnik
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Assistance Publique Hôpitaux de Paris, Service de gastroentérologie, MICI et assistance nutritive, Hôpital Beaujon, Clichy la Garenne, France
| | - Assiya Marah
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France
| | - Hatem Zaag
- Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France
| | - Dominique Cazals-Hatem
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Assistance Publique Hôpitaux de Paris, Service d'anatomopathologie, Hôpital Beaujon, Clichy la Garenne, France
| | - Philippe Seksik
- INSERM U1157, UMR 7203, F-7502, Paris, France.,Assistance Publique Hôpitaux de Paris, Hôpital Saint-Antoine, Paris, France
| | - Fanny Daniel
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France
| | - Jean-Pierre Hugot
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Assistance Publique Hôpitaux de Paris, Hôpital Robert Debré, Paris, France
| | - Gilles Wainrib
- Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Département d'Informatique, Equipe DATA, Ecole Normale Supérieure, Paris, France
| | - Xavier Tréton
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France.,Assistance Publique Hôpitaux de Paris, Service de gastroentérologie, MICI et assistance nutritive, Hôpital Beaujon, Clichy la Garenne, France
| | - Eric Ogier-Denis
- INSERM, Research Centre of Inflammation BP 416, Paris, France.,Université Paris-Diderot Sorbonne Paris-Cité, Paris, France.,Laboratory of Excellence Labex INFLAMEX, Sorbonne-Paris- Cité, Paris, France
| |
Collapse
|
14
|
Supervised EEG Source Imaging with Graph Regularization in Transformed Domain. Brain Inform 2017. [DOI: 10.1007/978-3-319-70772-3_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
|
15
|
Kim YH, Beak SH, Charidimou A, Song M. Discovering New Genes in the Pathways of Common Sporadic Neurodegenerative Diseases: A Bioinformatics Approach. J Alzheimers Dis 2016; 51:293-312. [PMID: 26836166 DOI: 10.3233/jad-150769] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Late onset Alzheimer's disease (AD) and Parkinson's disease (PD) are mostly "sporadic" age-related neurodegenerative disorders, but with a clear genetic component. However, their genetic architecture is complex and heterogeneous, largely remaining a conundrum, with only a handful of well-established genetic risk factors consistently associated with these diseases. It is possible that numerous, yet undiscovered, AD and PD related genes might exist. We focused on the 'gene' as a mediator to find new potential genes that might have a relationship with both disorders using bio-literature mining techniques. Based on Entrez Gene, we extracted the genes and directional gene-gene relation in the entire MEDLINE records and then constructed a directional gene-gene network. We identified common genes associated with two different but related diseases by performing shortest path analysis on the network. With our approach, we were able to identify and map already known genes that have a direct relationship with PD and AD. In addition, we identified 7 genes previously unknown to be a bridge between these two disorders. We confirmed 4 genes, ROS1, FMN1, ATP8A2, and SNORD12C, by biomedical literature and further checked 3 genes, ERVK-10, PRS, and C7orf49, that might have a high possibility to be related with both diseases. Additional experiments were performed to demonstrate the effectiveness of our proposed method. Comparing to the co-occurrence approach, our approach detected 25% more candidate genes and verified 10% more genes that have the relationship between both diseases than the co-occurrence approach did.
Collapse
Affiliation(s)
- Yong Hwan Kim
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea
| | - Seung Han Beak
- Institute of Convergence, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea
| | - Andreas Charidimou
- Department of Neurology, Massachusetts General Hospital Stroke Research Center, Harvard Medical School, Boston, MA, USA
| | - Min Song
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea
| |
Collapse
|
16
|
Yu C, Wang J. A Physical Mechanism and Global Quantification of Breast Cancer. PLoS One 2016; 11:e0157422. [PMID: 27410227 PMCID: PMC4943646 DOI: 10.1371/journal.pone.0157422] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 05/31/2016] [Indexed: 12/24/2022] Open
Abstract
Initiation and progression of cancer depend on many factors. Those on the genetic level are often considered crucial. To gain insight into the physical mechanisms of breast cancer, we construct a gene regulatory network (GRN) which reflects both genetic and environmental aspects of breast cancer. The construction of the GRN is based on available experimental data. Three basins of attraction, representing the normal, premalignant and cancer states respectively, were found on the phenotypic landscape. The progression of breast cancer can be seen as switching transitions between different state basins. We quantified the stabilities and kinetic paths of the three state basins to uncover the biological process of breast cancer formation. The gene expression levels at each state were obtained, which can be tested directly in experiments. Furthermore, by performing global sensitivity analysis on the landscape topography, six key genes (HER2, MDM2, TP53, BRCA1, ATM, CDK2) and four regulations (HER2⊣TP53, CDK2⊣BRCA1, ATM→MDM2, TP53→ATM) were identified as being critical for breast cancer. Interestingly, HER2 and MDM2 are the most popular targets for treating breast cancer. BRCA1 and TP53 are the most important oncogene of breast cancer and tumor suppressor gene, respectively. This further validates the feasibility of our model and the reliability of our prediction results. The regulation ATM→MDM2 has been extensive studied on DNA damage but not on breast cancer. We notice the importance of ATM→MDM2 on breast cancer. Previous studies of breast cancer have often focused on individual genes and the anti-cancer drugs are mainly used to target the individual genes. Our results show that the network-based strategy is more effective on treating breast cancer. The landscape approach serves as a new strategy for analyzing breast cancer on both the genetic and epigenetic levels and can help on designing network based medicine for breast cancer.
Collapse
Affiliation(s)
- Chong Yu
- State Key Laboratory of Electroanalytical Chemistry/Changchun Institute of Applied Chemistry, Chinese Academy of Sciences/Changchun, Jilin 130022, China
| | - Jin Wang
- State Key Laboratory of Electroanalytical Chemistry/Changchun Institute of Applied Chemistry, Chinese Academy of Sciences/Changchun, Jilin 130022, China
- College of Physics/Jilin University, Changchun, Jilin 130012, China
- Department of Chemistry, Physics & Applied Mathematics/State University of New York at Stony Brook/Stony Brook, NY 11794-3400, United States of America
| |
Collapse
|
17
|
Mayer G, Marcus K, Eisenacher M, Kohl M. Boolean modeling techniques for protein co-expression networks in systems medicine. Expert Rev Proteomics 2016; 13:555-69. [PMID: 27105325 DOI: 10.1080/14789450.2016.1181546] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Application of systems biology/systems medicine approaches is promising for proteomics/biomedical research, but requires selection of an adequate modeling type. AREAS COVERED This article reviews the existing Boolean network modeling approaches, which provide in comparison with alternative modeling techniques several advantages for the processing of proteomics data. Application of methods for inference, reduction and validation of protein co-expression networks that are derived from quantitative high-throughput proteomics measurements is presented. It's also shown how Boolean models can be used to derive system-theoretic characteristics that describe both the dynamical behavior of such networks as a whole and the properties of different cell states (e.g. healthy or diseased cell states). Furthermore, application of methods derived from control theory is proposed in order to simulate the effects of therapeutic interventions on such networks, which is a promising approach for the computer-assisted discovery of biomarkers and drug targets. Finally, the clinical application of Boolean modeling analyses is discussed. Expert commentary: Boolean modeling of proteomics data is still in its infancy. Progress in this field strongly depends on provision of a repository with public access to relevant reference models. Also required are community supported standards that facilitate input of both proteomics and patient related data (e.g. age, gender, laboratory results, etc.).
Collapse
Affiliation(s)
- Gerhard Mayer
- a Medizinisches Proteom Center (MPC) , Ruhr-Universität Bochum , Bochum , Germany
| | - Katrin Marcus
- a Medizinisches Proteom Center (MPC) , Ruhr-Universität Bochum , Bochum , Germany
| | - Martin Eisenacher
- a Medizinisches Proteom Center (MPC) , Ruhr-Universität Bochum , Bochum , Germany
| | - Michael Kohl
- a Medizinisches Proteom Center (MPC) , Ruhr-Universität Bochum , Bochum , Germany
| |
Collapse
|
18
|
Dholaniya PS, Ghosh S, Surampudi BR, Kondapi AK. A knowledge driven supervised learning approach to identify gene network of differentially up-regulated genes during neuronal senescence in Rattus norvegicus. Biosystems 2015; 135:9-14. [PMID: 26163927 DOI: 10.1016/j.biosystems.2015.07.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Revised: 05/18/2015] [Accepted: 07/06/2015] [Indexed: 12/22/2022]
Abstract
Various approaches have been described to infer the gene interaction network from expression data. Several models based on computational and mathematical methods are available. The fundamental thing in the identification of the gene interaction is their biological relevance. Two genes belonging to the same pathway are more likely to affect the expression of each other than the genes of two different pathways. In the present study, interaction network of genes is described based on upregulated genes during neuronal senescence in the Cerebellar granule neurons of rat. We have adopted a supervised learning method and used it in combination with biological pathway information of the genes to develop a gene interaction network. Further modular analysis of the network has been done to identify senescence-related marker genes. Currently there is no adequate information available about the genes implicated in neuronal senescence. Thus identifying multipath genes belonging to the pathway affected by senescence might be very useful in studying the senescence process.
Collapse
Affiliation(s)
- Pankaj Singh Dholaniya
- Department of Biotechnology and Bioinfomatics, School of Life Sciences, University of Hyderabad, Hyderabad 500046, Telangana, India; Cognitive Science Lab, International Institute of Information Technology (IIIT) Hyderabad, Hyderabad 500032, Telangana, India
| | - Soumitra Ghosh
- School of Computer and Information Sciences, University of Hyderabad, Hyderabad 500046, Telangana, India; Cognitive Science Lab, International Institute of Information Technology (IIIT) Hyderabad, Hyderabad 500032, Telangana, India
| | - Bapi Raju Surampudi
- School of Computer and Information Sciences, University of Hyderabad, Hyderabad 500046, Telangana, India; Cognitive Science Lab, International Institute of Information Technology (IIIT) Hyderabad, Hyderabad 500032, Telangana, India
| | - Anand K Kondapi
- Department of Biotechnology and Bioinfomatics, School of Life Sciences, University of Hyderabad, Hyderabad 500046, Telangana, India; Cognitive Science Lab, International Institute of Information Technology (IIIT) Hyderabad, Hyderabad 500032, Telangana, India.
| |
Collapse
|
19
|
Cairelli MJ, Fiszman M, Zhang H, Rindflesch TC. Networks of neuroinjury semantic predications to identify biomarkers for mild traumatic brain injury. J Biomed Semantics 2015; 6:25. [PMID: 25992264 PMCID: PMC4436163 DOI: 10.1186/s13326-015-0022-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2014] [Accepted: 04/22/2015] [Indexed: 12/13/2022] Open
Abstract
Objective Mild traumatic brain injury (mTBI) has high prevalence in the military, among athletes, and in the general population worldwide (largely due to falls). Consequences can include a range of neuropsychological disorders. Unfortunately, such neural injury often goes undiagnosed due to the difficulty in identifying symptoms, so the discovery of an effective biomarker would greatly assist diagnosis; however, no single biomarker has been identified. We identify several body substances as potential components of a panel of biomarkers to support the diagnosis of mild traumatic brain injury. Methods Our approach to diagnostic biomarker discovery combines ideas and techniques from systems medicine, natural language processing, and graph theory. We create a molecular interaction network that represents neural injury and is composed of relationships automatically extracted from the literature. We retrieve citations related to neurological injury and extract relationships (semantic predications) that contain potential biomarkers. After linking all relationships together to create a network representing neural injury, we filter the network by relationship frequency and concept connectivity to reduce the set to a manageable size of higher interest substances. Results 99,437 relevant citations yielded 26,441 unique relations. 18,085 of these contained a potential biomarker as subject or object with a total of 6246 unique concepts. After filtering by graph metrics, the set was reduced to 1021 relationships with 49 unique concepts, including 17 potential biomarkers. Conclusion We created a network of relationships containing substances derived from 99,437 citations and filtered using graph metrics to provide a set of 17 potential biomarkers. We discuss the interaction of several of these (glutamate, glucose, and lactate) as the basis for more effective diagnosis than is currently possible. This method provides an opportunity to focus the effort of wet bench research on those substances with the highest potential as biomarkers for mTBI.
Collapse
Affiliation(s)
- Michael J Cairelli
- National Institutes of Health, National Library of Medicine, 38A 9N912A, 8600 Rockville Pike, Bethesda, MD 20892 USA
| | - Marcelo Fiszman
- National Institutes of Health, National Library of Medicine, 38A 9N912A, 8600 Rockville Pike, Bethesda, MD 20892 USA
| | - Han Zhang
- Department of Medical Informatics, China Medical University, Shenyang, Liaoning 110001 China
| | - Thomas C Rindflesch
- National Institutes of Health, National Library of Medicine, 38A 9N912A, 8600 Rockville Pike, Bethesda, MD 20892 USA
| |
Collapse
|
20
|
Data Integration for Microarrays: Enhanced Inference for Gene Regulatory Networks. MICROARRAYS 2015; 4:255-69. [PMID: 27600224 PMCID: PMC4996389 DOI: 10.3390/microarrays4020255] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 04/30/2015] [Indexed: 01/01/2023]
Abstract
Microarray technologies have been the basis of numerous important findings regarding gene expression in the few last decades. Studies have generated large amounts of data describing various processes, which, due to the existence of public databases, are widely available for further analysis. Given their lower cost and higher maturity compared to newer sequencing technologies, these data continue to be produced, even though data quality has been the subject of some debate. However, given the large volume of data generated, integration can help overcome some issues related, e.g., to noise or reduced time resolution, while providing additional insight on features not directly addressed by sequencing methods. Here, we present an integration test case based on public Drosophila melanogaster datasets (gene expression, binding site affinities, known interactions). Using an evolutionary computation framework, we show how integration can enhance the ability to recover transcriptional gene regulatory networks from these data, as well as indicating which data types are more important for quantitative and qualitative network inference. Our results show a clear improvement in performance when multiple datasets are integrated, indicating that microarray data will remain a valuable and viable resource for some time to come.
Collapse
|
21
|
Abstract
Background High-throughput technologies became common tools to decipher genome-wide changes of gene expression (GE) patterns. Functional analysis of GE patterns is a daunting task as it requires often recourse to the public repositories of biological knowledge. On the other hand, in many cases researcher's inquiry can be served by a comprehensive glimpse. The KEGG PATHWAY database is a compilation of manually verified maps of biological interactions represented by the complete set of pathways related to signal transduction and other cellular processes. Rapid mapping of the differentially expressed genes to the KEGG pathways may provide an idea about the functional relevance of the gene lists corresponding to the high-throughput expression data. Results Here we present a web based graphic tool KEGG Pathway Painter (KPP). KPP paints pathways from the KEGG database using large sets of the candidate genes accompanied by "overexpressed" or "underexpressed" marks, for example, those generated by microarrays or miRNA profilings. Conclusion KPP provides fast and comprehensive visualization of the global GE changes by consolidating a list of the color-coded candidate genes into the KEGG pathways. KPP is freely available and can be accessed at http://web.cos.gmu.edu/~gmanyam/kegg/
Collapse
|
22
|
LGscore: A method to identify disease-related genes using biological literature and Google data. J Biomed Inform 2015; 54:270-82. [DOI: 10.1016/j.jbi.2015.01.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 12/23/2014] [Accepted: 01/05/2015] [Indexed: 02/05/2023]
|
23
|
Abstract
With the development of high-throughput genomic technologies, large, genome-wide datasets have been collected, and the integration of these datasets should provide large-scale, multidimensional, and insightful views of biological systems. We developed a method for gene association network construction based on gene expression data that integrate a variety of biological resources. Assuming gene expression data are from a multivariate Gaussian distribution, a graphical lasso (glasso) algorithm is able to estimate the sparse inverse covariance matrix by a lasso (L1) penalty. The inverse covariance matrix can be seen as direct correlation between gene pairs in the gene association network. In our work, instead of using a single penalty, different penalty values were applied for gene pairs based on a priori knowledge as to whether the two genes should be connected. The a priori information can be calculated or retrieved from other biological data, e.g., Gene Ontology similarity, protein-protein interaction, gene regulatory network. By incorporating prior knowledge, the weighted graphical lasso (wglasso) outperforms the original glasso both on simulations and on data from Arabidopsis. Simulation studies show that even when some prior knowledge is not correct, the overall quality of the wglasso network was still greater than when not incorporating that information, e.g., glasso.
Collapse
|